JP2001022709A

JP2001022709A - Cluster system and computer-readable storage medium storing program

Info

Publication number: JP2001022709A
Application number: JP11198971A
Authority: JP
Inventors: Akifumi Murata; 明文村田; Makoto Koishi; 誠小石
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-07-13
Filing date: 1999-07-13
Publication date: 2001-01-26

Abstract

PROBLEM TO BE SOLVED: To easily introduce a new program into a cluster system and to continuously execute the program even if some abnormality occurs. SOLUTION: In the cluster system which monitors the operation states of programs 6a to 6c running on computers 2a and 2b, an identification information acquiring means 11a acquires identification information of the programs 6a to 6c and a monitor means 12a monitor whether or not the programs 6a to 6c indicated by the acquired identification information are normal. When it is judged that a monitored program is abnormal, a restarting means 12b restarts the abnormal program on the computer 2a where the abnormal program was executed. Once the abnormality of the restarted program is detected, a program transfer means 11b executes the programs 6a to 6c on the computer 2a where the abnormal program was executed, on the other computer 2b.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数の計算機が結
合された環境における各計算機の動作を監視するクラス
タシステム及びプログラムを記憶したコンピュータ読み
取り可能な記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a cluster system for monitoring the operation of each computer in an environment where a plurality of computers are connected, and a computer-readable storage medium storing a program.

【０００２】[0002]

【従来の技術】クラスタシステムは、ネットワークを介
して接続された計算機（コンピュータ）にクラスタ・ソ
フトウェアを搭載して構築される。2. Description of the Related Art A cluster system is constructed by mounting cluster software on a computer (computer) connected via a network.

【０００３】図５は、従来のクラスタシステムの概略を
例示するブロック図である。このクラスタシステム１で
は、２台の計算機（コンピュータ）２ａ、２ｂが通信回
線３で接続されており、各計算機２ａ、２ｂ上ではオペ
レーティング・システム（以下、「ＯＳ」という）４と
クラスタ・ソフトウェア５とが実行されている。FIG. 5 is a block diagram schematically illustrating a conventional cluster system. In this cluster system 1, two computers (computers) 2a and 2b are connected by a communication line 3. On each of the computers 2a and 2b, an operating system (hereinafter referred to as "OS") 4 and cluster software 5 And is running.

【０００４】クラスタシステム１を構成する各計算機２
ａ、２ｂは、例えばデータベース管理プログラム、電子
メール管理プログラム、ディレクトリ・サービス提供用
プログラム、通信プログラムのようなアプリケーション
・プログラム（以下、「アプリケーション」という）を
実行可能である。図５では、計算機２ａ上でアプリケー
ション６ａ〜６ｃが実行されている場合を例として示し
ている。Each computer 2 constituting the cluster system 1
a and 2b can execute application programs (hereinafter, referred to as “applications”) such as a database management program, an e-mail management program, a directory service providing program, and a communication program. FIG. 5 shows an example in which the applications 6a to 6c are executed on the computer 2a.

【０００５】このような各種アプリケーション６ａ〜６
ｃをクラスタシステム１に導入する際には、各アプリケ
ーション６ａ〜６ｃ専用の監視プログラム（監視モジュ
ール）７ａ〜７ｃをクラスタ・ソフトウェア５に追加す
る必要がある。[0005] Such various applications 6a to 6
When introducing c into the cluster system 1, it is necessary to add monitoring programs (monitoring modules) 7a to 7c dedicated to the respective applications 6a to 6c to the cluster software 5.

【０００６】このアプリケーション６ａ〜６ｃ専用の監
視プログラム７ａ〜７ｃは、それぞれのアプリケーショ
ン６ａ〜６ｃが実行されている場合に、ＯＳ４のコマン
ドにより定期的に実行状態を調査する。[0006] The monitoring programs 7a to 7c dedicated to the applications 6a to 6c periodically check the execution state by a command of the OS 4 when the respective applications 6a to 6c are being executed.

【０００７】また、監視プログラム７ａ〜７ｃは、この
ＯＳ４のコマンドによる調査の結果、監視対象のアプリ
ケーション６ａ〜６ｃの動作異常や、監視対象のアプリ
ケーション６ａ〜６ｃを実行している計算機２ａの異常
を検出すると、この監視対象のアプリケーション６ａ〜
６ｃの再起動を行う。[0007] The monitoring programs 7a to 7c, as a result of the investigation using the command of the OS 4, find out about the abnormal operation of the monitored applications 6a to 6c and the abnormalities of the computer 2a executing the monitored applications 6a to 6c. When detected, the application 6a to be monitored
6c is restarted.

【０００８】さらに、監視プログラム７ａ〜７ｃは、こ
の再起動が失敗した場合には、オペレータからの指示に
したがって、監視対象のアプリケーション６ａ〜６ｃを
他の計算機２ｂ上で起動させる処理を実行する。Further, if the restart fails, the monitoring programs 7a to 7c execute a process of starting the applications 6a to 6c to be monitored on another computer 2b in accordance with an instruction from the operator.

【０００９】すなわち、クラスタ・ソフトウェア５は、
オペレータに指定されたアプリケーション６ａ〜６ｃを
立ち上げ、定期的に互いの計算機２ａ、２ｂが正常か否
かを監視し、また計算機２ａ、２ｂ上で実行されている
アプリケーション６ａ〜６ｃが正常か否かを監視する。
さらに、監視の結果、障害発生を検出すると、この障害
に関係のあるアプリケーションの再起動を試み、再起動
後さらに障害が発生した場合には、その旨をオペレータ
に通知する。その後、クラスタ・ソフトウェア５は、オ
ペレータからの指示があれば、障害の発生した計算機２
ａ上のデータやプログラムを他の計算機２ｂに引き継が
せる。That is, the cluster software 5 comprises:
The applications 6a to 6c designated by the operator are started, and the computers 2a and 2b are periodically monitored to determine whether or not the computers 2a and 2b are normal, and whether the applications 6a to 6c running on the computers 2a and 2b are normal. Watch out.
Further, as a result of monitoring, when the occurrence of a failure is detected, an attempt is made to restart the application related to the failure, and if a further failure occurs after the restart, the fact is notified to the operator. Thereafter, the cluster software 5, upon receiving an instruction from the operator, executes the failed computer 2
The data and program on a are transferred to another computer 2b.

【００１０】上記のような従来のクラスタシステム１の
動作について説明する。従来のクラスタシステム１上で
動作している各種アプリケーション６ａ〜６ｃは、アプ
リケーション６ａ〜６ｃ自体の問題、ＯＳ４やハードウ
ェアの問題、オペレータの操作誤りなどによって、正常
に動作しなくなったり、計算機２ａ上から消滅する場合
がある。The operation of the above-described conventional cluster system 1 will be described. The various applications 6a to 6c operating on the conventional cluster system 1 may not operate normally due to problems of the applications 6a to 6c themselves, problems of the OS 4 and hardware, operator's operation errors, and the like, or may fail to operate on the computer 2a. May disappear from.

【００１１】このような場合に備えて、計算機２ａのオ
ペレータや監視プログラム７ａ〜７ｃは、ＯＳ４が提供
する監視用コマンドを実行したり、ログメッセージを参
照することで、各種アプリケーション６ａ〜６ｃあるい
は計算機２ａの動作状況を監視している。In preparation for such a case, the operator of the computer 2a and the monitoring programs 7a to 7c execute the monitoring commands provided by the OS 4 and refer to the log messages to execute various applications 6a to 6c or the computer. 2a is monitored.

【００１２】この動作状況監視の結果、何らかの問題が
検出された場合に、オペレータや監視プログラム７ａ〜
７ｃは、その問題の種類に対応して異常とされたアプリ
ケーションを再起動させたり、計算機２ａを再立ち上げ
を行う。あるいは、複数台の計算機２ａ、２ｂが疎結合
されたクラスタシステム１においては、オペレータが指
示を発して、障害の発生した計算機２ａと異なる他の計
算機２ｂ上で障害の発生したアプリケーションが引き継
がれる。As a result of the operation status monitoring, if any problem is detected, the operator or the monitoring program 7a to
7c restarts the application determined to be abnormal according to the type of the problem, or restarts the computer 2a. Alternatively, in the cluster system 1 in which a plurality of computers 2a and 2b are loosely coupled, an operator issues an instruction, and the failed application is taken over on another computer 2b different from the failed computer 2a.

【００１３】[0013]

【発明が解決しようとする課題】上記のような図５に例
示される従来のクラスタシステム１においては、動作中
のアプリケーション６ａ〜６ｃのいずれかに異常が発生
した場合に、オペレータや監視プログラム７ａ〜７ｃ
は、発生した異常の状況に応じて、計算機２ａ上で異常
の発生したアプリケーションの再起動を行う。そして、
それでも異常が発生する場合には、他系の計算機２ｂに
この異常の発生するアプリケーションの引き継ぎを行
う。In the conventional cluster system 1 illustrated in FIG. 5 as described above, when an abnormality occurs in any of the running applications 6a to 6c, the operator or the monitoring program 7a ~ 7c
Restarts the application on which an error has occurred on the computer 2a in accordance with the status of the error that has occurred. And
If the abnormality still occurs, the application in which the abnormality occurs is taken over to the other computer 2b.

【００１４】しかしながら、この異常の発生したアプリ
ケーションの他系の計算機２ｂへの引き継ぎはオペレー
タの判断で行われるので、オペレータがアプリケーショ
ン６ａ〜６ｃに関するエラーメッセージを見落とした
り、監視プログラム７ａ〜７ｃが発生させるエラーメッ
セージを見落とすことがある。However, the application in which the abnormality has occurred is taken over to another computer 2b of the other system by the judgment of the operator, so that the operator can overlook the error message relating to the application 6a to 6c or generate the monitoring program 7a to 7c. You may overlook error messages.

【００１５】この場合、他系の計算機２ｂへの引き継ぎ
が遅れ、当該クラスタシステム１の運用に支障をきたす
場合がある。また、これにより、クラスタシステム１の
信頼性、可用性が低下する場合がある。In this case, takeover to the computer 2b of the other system is delayed, which may hinder the operation of the cluster system 1. As a result, the reliability and availability of the cluster system 1 may be reduced.

【００１６】また、従来のクラスタシステム１において
は、導入するアプリケーション６ａ〜６ｃ用の監視プロ
グラム７ａ〜７ｃをオペレータがクラスタ・ソフトウェ
ア５に組み込まなければならないため、アプリケーショ
ン導入時にオペレータに手間がかかるという問題があ
る。In addition, in the conventional cluster system 1, since the operator must incorporate the monitoring programs 7a to 7c for the applications 6a to 6c to be installed into the cluster software 5, the operator is troublesome when introducing the applications. There is.

【００１７】本発明は、以上のような実情に鑑みてなさ
れたもので、クラスタシステムに容易に新規のプログラ
ムを導入可能であり、何らかの異常が発生しても継続し
てプログラムを実行可能なクラスタシステム及びプログ
ラムを記憶したコンピュータ読み取り可能な記憶媒体を
提供することを目的とする。The present invention has been made in view of the above circumstances, and a new program can be easily introduced into a cluster system, and a cluster capable of continuously executing a program even if some abnormality occurs. It is an object of the present invention to provide a computer-readable storage medium storing a system and a program.

【００１８】[0018]

【課題を解決するための手段】本発明の骨子は、クラス
タシステムを構成する計算機上で動作するプログラムを
プログラムの種別に依存することなくまとめて監視する
手段を備えた点にある。また、異常検出後に再起動した
プログラムの異常をさらに検出した場合に、この異常を
検出したプログラムを実行していた計算機上で動作して
いたプログラムのうち移動可能なプログラム全てを他の
計算機上で起動させる点にある。The gist of the present invention lies in the provision of means for monitoring programs operating on computers constituting a cluster system collectively without depending on the types of the programs. In addition, when an abnormality of the program restarted after the abnormality is detected is further detected, all the movable programs among the programs operating on the computer that was executing the program that detected the abnormality are transferred to another computer. The point is to start.

【００１９】以下、本発明を実現するにあたって講じた
具体的手段について説明する。Hereinafter, specific measures taken to realize the present invention will be described.

【００２０】第１の発明は、複数の計算機が結合された
環境で動作するプログラムの動作状況を監視するクラス
タシステムに関する発明である。The first invention relates to a cluster system for monitoring the operation status of a program operating in an environment in which a plurality of computers are connected.

【００２１】この第１の発明のクラスタシステムは、こ
の環境で動作するプログラムの識別情報を取得する識別
情報取得手段と、当該識別情報取得手段によって取得さ
れた識別情報で示されるプログラムが正常か否かを監視
する監視手段と、当該監視手段による監視によってプロ
グラムの異常が検出された場合に、この異常の検出され
たプログラムを実行していた計算機上でこの異常の検出
されたプログラムを再起動させる再起動手段と、当該再
起動手段によって再起動されたプログラムの異常が検出
された場合に、当該異常の検出されたプログラムを実行
していた計算機上で動作していたプログラムを他の計算
機上で実行させるプログラム移転手段とを具備する。In the cluster system according to the first aspect of the present invention, identification information acquiring means for acquiring identification information of a program operating in this environment, and whether or not the program indicated by the identification information acquired by the identification information acquiring means is normal Monitoring means for monitoring whether an abnormality of a program has been detected by the monitoring means, and restarting the program on which the abnormality has been detected on a computer which has been executing the program on which the abnormality has been detected. Restart means, and when an abnormality of the program restarted by the restart means is detected, the program operating on the computer that was executing the program in which the abnormality was detected is executed on another computer. Program transfer means to be executed.

【００２２】すなわち、この第１の発明のクラスタシス
テムにおいては、動作状態にあるプログラムの識別情報
が収集され、動作中のプログラムが自動的にまとめて監
視される。That is, in the cluster system according to the first aspect of the present invention, the identification information of the programs in the operating state is collected, and the running programs are automatically and collectively monitored.

【００２３】したがって、新規のプログラムを導入する
場合であっても、この新規のプログラム専用の監視プロ
グラムを導入する必要がない。Therefore, even when a new program is introduced, there is no need to introduce a monitoring program dedicated to the new program.

【００２４】ゆえに、専用の監視プログラムを特別に用
意しなくても、新規のプログラムを容易に導入できる。Therefore, a new program can be easily introduced without specially preparing a dedicated monitoring program.

【００２５】また、この第１の発明のクラスタシステム
においては、異常の生じたプログラムを再起動してもさ
らに異常が発生する場合には、この異常の発生する計算
機上で実行されていたプログラムが自動的に他の計算機
上で起動される。Further, in the cluster system according to the first aspect of the present invention, if a further abnormality occurs even when the abnormal program is restarted, the program executed on the computer where the abnormality occurs is executed. It is automatically started on another computer.

【００２６】したがって、オペレータが異常に気づかな
いために、プログラムの引き継ぎが遅れることを防止で
き、高い信頼性、可用性を確保することができる。Therefore, it is possible to prevent the operator from noticing any abnormality and delay in taking over the program, thereby ensuring high reliability and availability.

【００２７】第２の発明は、複数の計算機が結合された
環境で動作しアプリケーション・インターフェースを持
つアプリケーション・プログラムの動作状況を監視する
クラスタシステムに関する発明である。The second invention relates to a cluster system which operates in an environment in which a plurality of computers are connected and monitors the operation status of an application program having an application interface.

【００２８】この第２の発明のクラスタシステムは、こ
の環境で動作するアプリケーション・プログラムの識別
情報を取得する識別情報取得手段と、当該識別情報取得
手段によって取得された識別情報で示されるアプリケー
ション・プログラムのアプリケーション・インターフェ
イスを呼び出して正常に実行されているか否かを監視す
る監視手段と、当該監視手段による監視によってアプリ
ケーション・プログラムの異常が検出された場合に、こ
の異常の検出されたアプリケーション・プログラムを実
行していた計算機上でこの異常の検出されたアプリケー
ション・プログラムを再起動させる再起動手段と、当該
再起動手段によって再起動されたアプリケーション・プ
ログラムの異常が検出された場合に、当該異常の検出さ
れたアプリケーション・プログラムを実行していた計算
機上で動作していたプログラムを他の計算機上で実行さ
せるプログラム移転手段とを具備する。[0028] The cluster system according to the second aspect of the present invention comprises an identification information acquiring means for acquiring identification information of an application program operating in this environment, and an application program indicated by the identification information acquired by the identification information acquiring means. Monitoring means for calling whether the application program is normally executed by calling the application interface of the application program, and when an abnormality of the application program is detected by monitoring by the monitoring means, the application program in which the abnormality is detected is detected. A restarting means for restarting the application program in which the abnormality was detected on the executing computer; and detecting the abnormality when the abnormality of the application program restarted by the restarting means is detected. Application The emission program program running on a computer that was running; and a program transfer unit for executing other computer.

【００２９】この第２の発明においては、動作中のアプ
リケーションの識別情報が管理され、この動作中のアプ
リケーションの含むアプリケーション・インターフェイ
スに対して適宜呼び出しが行われる。この呼び出しの結
果、例えばこの呼び出しに関する応答がない場合や、あ
るいは応答としてエラーを受けた場合には、このアプリ
ケーションが異常とされる。In the second invention, the identification information of the running application is managed, and a call is appropriately made to an application interface included in the running application. As a result of this call, for example, when there is no response regarding this call or when an error is received as a response, the application is determined to be abnormal.

【００３０】したがって、各アプリケーション専用の監
視プログラムが必要ないため、上記第１の発明と同様の
作用効果を得ることができる。Therefore, since a monitoring program dedicated to each application is not required, the same functions and effects as those of the first aspect can be obtained.

【００３１】また、この第２の発明のクラスタシステム
においては、異常の生じたアプリケーション・プログラ
ムを再起動してもさらに異常が発生する場合には、他の
計算機上で起動される。Further, in the cluster system according to the second aspect of the present invention, if a further abnormality occurs even after restarting the application program in which the abnormality has occurred, it is started on another computer.

【００３２】したがって、上記第１の発明の場合と同様
に、オペレータが異常に気づかないために、プログラム
の引き継ぎが遅れることを防止でき、高い信頼性、可用
性を確保することができる。Therefore, as in the case of the first aspect, since the operator does not notice any abnormality, it is possible to prevent delay in taking over the program, and to ensure high reliability and availability.

【００３３】第３の発明は、コンピュータに、複数の計
算機が結合された環境で動作するプログラムの識別情報
を取得させる識別情報取得機能と、当該識別情報取得機
能によって取得された識別情報で示されるプログラムが
正常か否かを監視させる監視機能と、当該監視機能によ
る監視によってプログラムの異常が検出された場合に、
この異常の検出されたプログラムを実行していた計算機
上でこの異常の検出されたプログラムを再起動させる再
起動機能と、当該再起動機能によって再起動されたプロ
グラムの異常が検出された場合に、当該異常の検出され
たプログラムを実行していた計算機上で動作していたプ
ログラムを他の計算機上で実行させるプログラム移転機
能とを実現させるプログラムを記憶したコンピュータ読
み取り可能な記憶媒体である。According to a third aspect of the present invention, there is provided an identification information acquisition function for causing a computer to acquire identification information of a program operating in an environment in which a plurality of computers are combined, and identification information acquired by the identification information acquisition function. A monitoring function that monitors whether the program is normal or not, and when an abnormality of the program is detected by monitoring by the monitoring function,
A restart function for restarting the program on which the abnormality was detected on the computer that was executing the program on which the abnormality was detected, and when an abnormality of the program restarted by the restart function is detected, This is a computer-readable storage medium storing a program for realizing a program transfer function of causing a program running on a computer that has executed the program in which the abnormality is detected to be executed on another computer.

【００３４】また、第４の発明は、コンピュータに、複
数の計算機が結合された環境で動作しアプリケーション
・インターフェースを持つアプリケーション・プログラ
ムの識別情報を取得させる識別情報取得機能と、当該識
別情報取得機能によって取得された識別情報で示される
アプリケーション・プログラムのアプリケーション・イ
ンターフェイスを呼び出して正常に実行されているか否
かを監視させる監視機能と、当該監視機能による監視に
よってアプリケーション・プログラムの異常が検出され
た場合に、この異常の検出されたアプリケーション・プ
ログラムを実行していた計算機上でこの異常の検出され
たアプリケーション・プログラムを再起動させる再起動
機能と、当該再起動機能によって再起動されたアプリケ
ーション・プログラムの異常が検出された場合に、当該
異常の検出されたアプリケーション・プログラムを実行
していた計算機上で動作していたプログラムを他の計算
機上で実行させるプログラム移転機能とを実現させるプ
ログラムを記憶したコンピュータ読み取り可能な記憶媒
体である。According to a fourth aspect of the present invention, there is provided an identification information obtaining function for causing a computer to obtain identification information of an application program operating in an environment in which a plurality of computers are connected and having an application interface, and the identification information obtaining function. A monitoring function for calling the application interface of the application program indicated by the identification information acquired by the monitoring function to monitor whether the application program is normally executed, and when an abnormality of the application program is detected by the monitoring function. A restart function that restarts the application program in which the abnormality was detected on the computer that was executing the application program in which the abnormality was detected, and an application program that was restarted by the restart function. When a program abnormality is detected, a program for realizing a program transfer function of executing a program running on the computer that was executing the application program in which the abnormality was detected on another computer is stored. Computer-readable storage medium.

【００３５】第３及び第４の発明は、それぞれ第１及び
第２の発明で説明したクラスタシステムの機能をコンピ
ュータにより実現するためのプログラムを記憶したコン
ピュータ読み取り可能な記憶媒体である。The third and fourth inventions are computer-readable storage media storing a program for realizing the functions of the cluster system described in the first and second inventions by a computer.

【００３６】このようなプログラムを記憶した記憶媒体
を用いることによって、上述した機能を有していない計
算機や計算機システムに対しても、簡単に上述した機能
を付加することができる。By using a storage medium storing such a program, the above-described functions can be easily added to a computer or a computer system that does not have the above-mentioned functions.

【００３７】[0037]

【発明の実施の形態】以下、図面を参照しながら本発明
の実施の形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００３８】（第１の実施の形態）本実施の形態におい
ては、動作中のプログラムをまとめて監視し、何らかの
異常が発生したらそのプログラムを再起動し、それでも
異常が発生するようであればプログラムを他系の計算機
に自動的に引き継がせるクラスタシステムについて説明
する。(First Embodiment) In this embodiment, the running programs are monitored collectively, and if any abnormality occurs, the program is restarted. If any abnormality still occurs, the program is restarted. The following describes a cluster system that can automatically take over to another computer.

【００３９】図１は、本実施の形態に係るクラスタシス
テムの概略を例示するブロック図であり、図５と同一の
部分については同一の符号を付してその説明を省略する
かあるいは簡単に説明し、ここでは異なる部分について
のみ詳しく説明する。FIG. 1 is a block diagram schematically illustrating a cluster system according to the present embodiment. The same parts as those in FIG. 5 are denoted by the same reference numerals, and the description thereof will be omitted or simply described. Here, only different portions will be described in detail.

【００４０】クラスタシステム８を構成する一方の計算
機２ａは、通常時にアプリケーションを動作させるコン
ピュータである。これに対し、他方の計算機２ｂは、待
機系として動作するコンピュータであり、一方の計算機
２ａに異常が発生した場合にその業務を引き継ぐ。通信
回線３は、計算機２ａ、２ｂ間を送受信可能に接続して
いる。One of the computers 2a constituting the cluster system 8 is a computer which normally runs an application. On the other hand, the other computer 2b is a computer that operates as a standby system, and takes over its work when an abnormality occurs in one computer 2a. The communication line 3 is connected between the computers 2a and 2b so as to be able to transmit and receive.

【００４１】なお、計算機２ａ、２ｂにはそれぞれ記憶
領域９が設けられているが、図１では計算機２ｂ側の記
憶領域の記載は省略している。Although the storage areas 9 are provided in each of the computers 2a and 2b, the storage areas on the side of the computer 2b are not shown in FIG.

【００４２】ＯＳ４は、各計算機２ａ、２ｂを制御する
ソフトウェアであり、アプリケーション６ａ〜６ｃは、
各種の業務を処理するためのプログラムである。ここで
は、この各アプリケーション６ａ〜６ｃは計算機に常駐
した形態で利用されるプログラムとする。また、このア
プリケーション６ａ〜６ｃは、動作する計算機を移動可
能なプログラムとする。The OS 4 is software for controlling the computers 2a and 2b, and the applications 6a to 6c
This is a program for processing various tasks. Here, each of the applications 6a to 6c is a program used in a form resident in a computer. The applications 6a to 6c are programs that can move the operating computer.

【００４３】クラスタ・ソフトウェア１０は、主に、ク
ラスタ・ソフトウェア本体１１と監視プログラム１２と
から構成されており、計算機２ａ、２ｂ双方に備えられ
ている。以下においては、計算機２ａ側のクラスタソフ
トウェア１０を例として説明するが、計算機２ｂ側のク
ラスタソフトウェア１０も同様の機能を持ち、同様の動
作を行う。The cluster software 10 mainly comprises a cluster software main body 11 and a monitoring program 12, and is provided in both the computers 2a and 2b. Hereinafter, the cluster software 10 on the computer 2a side will be described as an example, but the cluster software 10 on the computer 2b side has the same function and performs the same operation.

【００４４】クラスタ・ソフトウェア本体１１は、先に
おいて説明した従来のクラスタ・ソフトウェア５と同様
の機能を保持しており、その中には識別情報取得機能１
１ａ、プログラム移転機能１１ｂ、計算機停止機能１１
ｃが含まれている。The cluster software main body 11 has the same functions as those of the conventional cluster software 5 described above.
1a, program transfer function 11b, computer stop function 11
c is included.

【００４５】識別情報取得機能１１ａは、この計算機２
ａ上で動作すべきアプリケーションのプロセスＩＤを記
憶領域９に保持する。これにより、計算機２ａ上で動作
しているはずのアプリケーションが特定される。The identification information acquisition function 11a is provided by the computer 2
The process ID of the application to be operated on a is stored in the storage area 9. As a result, an application that should be running on the computer 2a is specified.

【００４６】プログラム移転機能１１ｂは、計算機２ａ
上で動作していたアプリケーションを他の計算機２ｂ上
で動作させるための処理を行い、また計算機２ｂ上で動
作していたアプリケーションを計算機２ａ上で動作させ
るための処理を行う。計算機停止機能１１ｃは、計算機
２ａを停止するための処理を行う。The program transfer function 11b is executed by the computer 2a
A process for operating the application running on the other computer 2b is performed, and a process for operating the application running on the computer 2b on the computer 2a is performed. The computer stop function 11c performs a process for stopping the computer 2a.

【００４７】監視プログラム１２は、監視機能１２ａ、
再起動機能１２ｂ、プログラム移転指示機能１２ｃとを
含む。監視機能１２ａは、記憶領域９を参照し、この記
憶領域９にプロセスＩＤが登録されているアプリケーシ
ョン（計算機２ａ上に存在すべきアプリケーション）が
計算機２ａ上に存在するか否かを、ＯＳ４の提供するコ
マンドを呼び出すことで確認する。The monitoring program 12 has a monitoring function 12a,
A restart function 12b and a program transfer instruction function 12c are included. The monitoring function 12a refers to the storage area 9 and provides the OS 4 with information on whether or not an application (an application that should exist on the computer 2a) whose process ID is registered in the storage area 9 exists on the computer 2a. Confirm by calling the command to be executed.

【００４８】再起動機能１２ｂは、監視機能１２ａによ
って計算機２ａ上で動作していたアプリケーションの異
常終了が検出された場合に、この異常終了したアプリケ
ーションを当該計算機２ａ上で再起動させる。When the monitoring function 12a detects abnormal termination of an application running on the computer 2a, the restart function 12b restarts the abnormally terminated application on the computer 2a.

【００４９】プログラム移転指示機能１２ｃは、再起動
機能１２ｂによって再起動されたアプリケーションの異
常終了が再び監視機能１２ａによって検出された場合
に、計算機２ａ、２ｂのクラスタ・ソフトウェア本体１
１の持つプログラム移転機能１１ｂに対し、この計算機
２ａ上で動作するアプリケーション６ａ〜６ｃの移転指
示（計算機ｂ上でのアプリケーション６ａ〜６ｃの起動
・開始指示）を通知する。When the monitoring function 12a detects again the abnormal termination of the application restarted by the restart function 12b, the program transfer instructing function 12c starts the cluster software main unit 1 of the computers 2a and 2b.
1 is notified to the program transfer function 11b of the computer 1 of a transfer instruction of the applications 6a to 6c operating on the computer 2a (a start / start instruction of the applications 6a to 6c on the computer b).

【００５０】同様に、プログラム移転指示機能１２は、
再起動機能１２ｂによって再起動されたアプリケーショ
ンの異常終了が再び監視機能１２ａによって検出された
場合に、計算機２ａのクラスタソフトウェア本体１１の
持つ計算機停止機能１１ｃに計算機２ａの停止指示を通
知する。Similarly, the program transfer instruction function 12
When the monitoring function 12a detects again the abnormal termination of the application restarted by the restart function 12b, the monitoring function 12a notifies the computer stop function 11c of the cluster software main body 11 of the computer 2a of the instruction to stop the computer 2a.

【００５１】上記のような構成を持つクラスタシステム
８の動作について以下に説明する。図２は、本実施の形
態に係るクラスタシステム８の動作を示すフロー図であ
り、特に監視プログラム１２によるアプリケーション６
ａ〜６ｃの存在確認と、アプリケーション６ａ〜６ｃの
消滅検出時の処理手順を示している。The operation of the cluster system 8 having the above configuration will be described below. FIG. 2 is a flowchart showing the operation of the cluster system 8 according to the present embodiment.
The processing procedure at the time of confirming the existence of a to 6c and detecting the disappearance of the applications 6a to 6c is shown.

【００５２】クラスタシステム８においては、まず、監
視プログラム１２の監視機能１２ａによってプロセスＩ
Ｄが登録されている記憶領域９が参照され、監視すべき
アプリケーション６ａ〜６ｃのプロセスＩＤが取り出さ
れる（ｓ１）。In the cluster system 8, first, the process I is executed by the monitoring function 12a of the monitoring program 12.
The storage area 9 in which D is registered is referred to, and the process IDs of the applications 6a to 6c to be monitored are extracted (s1).

【００５３】次に、監視プログラム１２の監視機能１２
ａによってＯＳ４の提供するコマンドが呼び出され、計
算機２ａ上で動作しているアプリケーションのプロセス
ＩＤが取り出される（ｓ２）。これにより、計算機２ａ
上で動作しているアプリケーションのプロセスＩＤの一
覧情報が得られる。Next, the monitoring function 12 of the monitoring program 12
The command provided by the OS 4 is called by a, and the process ID of the application running on the computer 2a is extracted (s2). Thereby, the computer 2a
The list information of the process IDs of the applications running on the above is obtained.

【００５４】次に、記憶領域９から得られたプロセスＩ
Ｄ（処理ｓ１で得られたプロセスＩＤ）と、ＯＳ４のコ
マンド呼び出しにより得られたプロセスＩＤ（処理ｓ２
で得られたプロセスＩＤ）とが、監視プログラム１２の
監視機能１２ａにより比較される。そして、計算機２ａ
で動作しているべき監視対象のアプリケーション６ａ〜
６ｃが消滅しているか否かが判定される（ｓ３）。Next, the process I obtained from the storage area 9
D (process ID obtained in process s1) and the process ID obtained by calling the command of OS4 (process s2
And the monitoring function 12a of the monitoring program 12. And the computer 2a
Applications to be monitored that should be running on
It is determined whether 6c has disappeared (s3).

【００５５】この比較の結果、監視すべきアプリケーシ
ョン６ａ〜６ｃが消滅していない場合には、上記の監視
が繰り返される。As a result of the comparison, if the applications 6a to 6c to be monitored have not disappeared, the above monitoring is repeated.

【００５６】監視すべきアプリケーション６ａ〜６ｃの
いずれかが消滅している場合には、再起動機能１２ｂに
より消滅したアプリケーションが先において一度再起動
されたものか否かが判定される（ｓ４）。If any of the applications 6a to 6c to be monitored has disappeared, it is determined whether or not the application that has disappeared by the restart function 12b has previously been restarted once (s4).

【００５７】判定の結果、未だ再起動されていない場合
には、消滅したアプリケーションが監視プログラム１２
の再起動機能１２ｂによって消滅前と同一の計算機２ａ
上で再起動される（ｓ５）。If the result of determination is that the application has not been restarted yet, the disappeared application is
The same computer 2a as before the disappearance by the restart function 12b
Is restarted (s5).

【００５８】一方、消滅したアプリケーションが再起動
済みであった場合には、監視プログラム１２のプログラ
ム移転指示機能１２ｃにより、計算機２ａのクラスタ・
ソフトウェア１０の持つ計算機停止機能１１ｃに計算機
２ａの停止が指示される。また、再起動しても異常が発
生するのはアプリケーションよりむしろ他の異常（Ｏ
Ｓ、計算機の異常）である可能性が高いとして、計算機
２ａ上で動作していたアプリケーション６ａ〜６ｃの計
算機２ｂ上での起動が、計算機２ａ、２ｂのクラスタ・
ソフトウェア１０の持つプログラム移転機能１１ｂに指
示される（ｓ６）。On the other hand, when the disappeared application has been restarted, the cluster transfer of the computer 2a is performed by the program transfer instruction function 12c of the monitoring program 12.
An instruction to stop the computer 2a is given to the computer stop function 11c of the software 10. In addition, even if the system is restarted, the error occurs only for other errors (O
S, it is highly likely that the application 6a to 6c running on the computer 2a has started on the computer 2b.
An instruction is given to the program transfer function 11b of the software 10 (s6).

【００５９】この指示により、計算機２ａのクラスタ・
ソフトウェア１０の持つ計算機停止機能１１ｃが計算機
２ａを停止させる。また、計算機２ａ、２ｂのクラスタ
ソフトウェア１０の持つプログラム移転機能１１ｂによ
りアプリケーション６ａ〜６ｃの動作が計算機２ｂに引
き継がれる。According to this instruction, the cluster of the computer 2a
The computer stop function 11c of the software 10 stops the computer 2a. The operations of the applications 6a to 6c are taken over by the computer 2b by the program transfer function 11b of the cluster software 10 of the computers 2a and 2b.

【００６０】以上説明したように、本実施の形態に係る
クラスタシステム８においては、アプリケーション６ａ
〜６ｃをまとめて監視する監視プログラム１２をクラス
タ・ソフトウェア１０に加えている。As described above, in the cluster system 8 according to the present embodiment, the application 6a
A monitoring program 12 that collectively monitors .about.6c is added to the cluster software 10.

【００６１】したがって、新規のアプリケーションを導
入するたびに、このアプリケーション専用の監視プログ
ラムをオペレータが加える必要がなく、オペレータの作
業を軽減させることができる。Therefore, every time a new application is introduced, there is no need for the operator to add a monitoring program dedicated to this application, and the work of the operator can be reduced.

【００６２】また、本実施の形態に係るクラスタシステ
ム８においては、アプリケーションに対する再起動後、
さらに再起動が失敗した場合には、この再起動が失敗し
た計算機上で動作していた移動可能なアプリケーション
の動作を全て待機系の他の計算機に引き継がせる。In the cluster system 8 according to the present embodiment, after the application is restarted,
Further, when the restart fails, all the operations of the movable application running on the computer on which the restart has failed are taken over by the other computers in the standby system.

【００６３】したがって、アプリケーションの停止状態
が持続することを防止し、システムの信頼性、可用性が
向上される。Therefore, the suspension of the application is prevented from continuing, and the reliability and availability of the system are improved.

【００６４】なお、本実施の形態においては、監視対象
のプログラムが常駐形式で動作するアプリケーション６
ａ〜６ｃの場合を例として説明しているが、これに限定
されるものではない。例えば、アプリケーションではな
いデーモンや、常駐形式ではないアプリケーションを監
視対象とする場合でも、同様の手法を適用することで同
様の効果を得ることができる。また、監視対象のプログ
ラムの数にも制限されることなく、いくつでもよい。In the present embodiment, the program to be monitored is an application 6 that operates in a resident format.
Although the case of a to 6c has been described as an example, the present invention is not limited to this. For example, even when a daemon that is not an application or an application that is not a resident type is to be monitored, the same effect can be obtained by applying the same method. Further, the number of programs to be monitored is not limited and may be any number.

【００６５】また、本実施の形態においては、２台の計
算機２ａ、２ｂによりクラスタシステム８が構成される
場合を例として説明しているが、これに限定されるもの
ではなく、３台以上の計算機によりクラスタシステムが
構成される場合にも同様に適用可能である。In this embodiment, the case where the cluster system 8 is constituted by two computers 2a and 2b is described as an example. However, the present invention is not limited to this. The present invention is also applicable to a case where a cluster system is configured by computers.

【００６６】また、本実施の形態に係るクラスタシステ
ム８は、同様の作用・機能を実現可能であれば各構成要
素の配置を変更させてもよく、また各構成要素を自由に
組み合わせてもよい。例えば、識別情報取得機能１１ａ
は、クラスタ・ソフトウェア本体１１に備えるのではな
く、監視プログラム１２に備えてもよい。In the cluster system 8 according to the present embodiment, the arrangement of each component may be changed as long as the same operation and function can be realized, and each component may be freely combined. . For example, the identification information acquisition function 11a
May be provided not in the cluster software main body 11 but in the monitoring program 12.

【００６７】（第２の実施の形態）本実施の形態におい
ては、アプリケーション・インターフェイス（以下、
「ＡＰＩ」という）を持つアプリケーションを監視対象
とし、このアプリケーションを他系の計算機に自動的に
引き継がせるクラスタシステムについて説明する。(Second Embodiment) In this embodiment, an application interface (hereinafter, referred to as an application interface) will be described.
A cluster system in which an application having an "API") is set as a monitoring target and this application is automatically taken over by another computer will be described.

【００６８】図３は、本実施の形態に係るクラスタシス
テムの概略を例示するブロック図であり、図１、５と同
一の部分については同一の符号を付してその説明を省略
するかあるいは簡単に説明し、ここでは異なる部分につ
いてのみ詳しく説明する。FIG. 3 is a block diagram schematically illustrating a cluster system according to the present embodiment. The same parts as those in FIGS. 1 and 5 are denoted by the same reference numerals, and the description thereof will be omitted or simplified. And only the different parts will be described in detail here.

【００６９】本実施の形態に係るクラスタシステム１３
の基礎的な構成は、先で述べたクラスタシステム８と同
様であるが、アプリケーションのＡＰＩを利用して監視
を行う点が異なる。The cluster system 13 according to the present embodiment
Is basically the same as the cluster system 8 described above, except that monitoring is performed using an API of an application.

【００７０】すなわち、アプリケーション１４ａ〜１４
ｃは、それぞれに対するインターフェイスを扱うＡＰＩ
１５ａ〜１５ｃを備えている。ＡＰＩ１５ａ〜１５ｃ
は、外部のプログラムにアプリケーション１４ａ〜１４
ｃの機能を利用させるために動作する。例えば、アプリ
ケーション１４ａがデータベース管理プログラムである
場合には、ＡＰＩ１５ａはＳＱＬ命令を受け付け、処理
後の結果を返す。That is, the applications 14a to 14
c is the API that handles the interface for each
15a to 15c are provided. APIs 15a to 15c
Indicates that the applications 14a to 14
It operates to use the function of c. For example, when the application 14a is a database management program, the API 15a accepts an SQL command and returns a processed result.

【００７１】クラスタ・ソフトウェア１６の監視プログ
ラム１７は、監視対象のアプリケーション１４ａ〜１４
ｃの各ＡＰＩ１５ａ〜１５ｃを呼び出す。その結果、何
らかのエラーがアプリケーション１４ａ〜１４ｃの持つ
ＡＰＩ１５ａ〜１５ｃのいずれかから返却されたり、あ
るいはＡＰＩ１５ａ〜１５ｂのいずれかから制御がリタ
ーンしなくなった場合、監視プログラム１７は、異常を
検出したＡＰＩを持つアプリケーションを停止させ、再
起動させる。The monitoring program 17 of the cluster software 16 includes the monitoring target applications 14a to 14a.
Call each API 15a to 15c of c. As a result, if any error is returned from any of the APIs 15a to 15c of the applications 14a to 14c, or if control is not returned from any of the APIs 15a to 15b, the monitoring program 17 calls the API that has detected the abnormality. Stop the application and restart it.

【００７２】また、この監視プログラム１７は、再起動
後のＡＰＩ呼び出しでまだ異常を検出する場合には、計
算機２ａのクラスタ・ソフトウェア本体１１に対して、
計算機２ａの停止を指示し、さらに計算機２ａ、２ｂの
クラスタ・ソフトウェア本体１１に対して、計算機２ｂ
上でのアプリケーション１４ａ〜１４ｃの起動・開始を
指示する。When the monitoring program 17 still detects an abnormality in the API call after the restart, the monitoring program 17 sends a message to the cluster software main body 11 of the computer 2a.
The computer 2a is instructed to stop, and the cluster software main body 11 of the computers 2a and 2b is instructed to stop the computer 2b.
The application 14a to 14c is instructed to start and start.

【００７３】上記のような構成を持つクラスタシステム
１３の動作について以下に説明する。図４は、本実施の
形態に係るクラスタシステム１３の動作を示すフロー図
であり、特に監視プログラム１７によるアプリケーショ
ン１４ａ〜１４ｃの持つＡＰＩ１５ａ〜１５ｃの呼び出
しと、ＡＰＩ１５ａ〜１５ｃの呼び出しにおいて異常を
検出した際の処理手順を示している。The operation of the cluster system 13 having the above configuration will be described below. FIG. 4 is a flowchart showing the operation of the cluster system 13 according to the present embodiment. In particular, an abnormality is detected in the monitoring program 17 calling the APIs 15a to 15c of the applications 14a to 14c and calling the APIs 15a to 15c. The processing procedure at the time is shown.

【００７４】このクラスタシステム１３においては、ま
ず、監視プログラム１７の監視機能によって記憶領域９
が参照され、この監視プログラム１７によって監視すべ
きアプリケーション１４ａ〜１４ｂの持つＡＰＩ１５ａ
〜１５ｃが呼び出される（ｔ１）。In the cluster system 13, first, the storage area 9 is monitored by the monitoring function of the monitoring program 17.
Is referred to, and the API 15a of the applications 14a to 14b to be monitored by the monitoring program 17
To 15c are called (t1).

【００７５】ここで、このＡＰＩ１５ａ〜１５ｂの呼び
出しに対して、ＡＰＩ１５ａ〜１５ｃのいずれかからエ
ラー返却を受けたか、あるいはリターンを返却しない
（制御を戻さない）かの判定が、監視プログラム１７の
監視機能によって行われる（ｔ２）。Here, in response to the call of the APIs 15a to 15b, the monitoring program 17 determines whether an error has been returned from any of the APIs 15a to 15c or whether or not a return has been returned (control has not been returned). This is performed by the function (t2).

【００７６】正常にリターンを受けた場合には、上記の
処理が繰り返されるが、エラー返却を受けたりリターン
を返却しない場合には、そのＡＰＩを持つアプリケーシ
ョンが再起動済みか否かが監視プログラム１７の再起動
機能により判定される（ｔ３）。When a return is normally received, the above processing is repeated. However, when an error is returned or a return is not returned, it is determined whether or not the application having the API has been restarted. (T3).

【００７７】判定の結果、未だ再起動されていない場合
には、この正常なリターンを返さないＡＰＩを持つアプ
リケーションが、監視プログラム１７の再起動機能によ
って同一の計算機２ａ上で再起動される（ｔ４）。If the result of the determination is that the application has not been restarted yet, an application having an API that does not return a normal return is restarted on the same computer 2a by the restart function of the monitoring program 17 (t4). ).

【００７８】一方、既に再起動済みの場合には、監視プ
ログラム１７のプログラム移転指示機能により、計算機
２ａの停止が計算機２ａのクラスタ・ソフトウェア本体
１１に指示される。また、計算機２ａ上で動作していた
アプリケーション１４ａ〜１４ｃの計算機２ｂ上での起
動が、計算機２ａ、２ｂのクラスタソフトウェア本体１
１に指示される（ｔ５）。On the other hand, if the computer has already been restarted, the shutdown of the computer 2a is instructed to the cluster software main body 11 of the computer 2a by the program transfer instruction function of the monitoring program 17. In addition, the activation of the applications 14a to 14c running on the computer 2a on the computer 2b causes the cluster software main unit 1 of the computers 2a and 2b to start.
1 (t5).

【００７９】この指示により、計算機２ａのクラスタ・
ソフトウェア１６の持つ計算機停止機能が計算機２ａを
停止させる。また、計算機２ａ、２ｂのクラスタ・ソフ
トウェア１６の持つプログラム移転機能によりアプリケ
ーション１４ａ〜１４ｃの動作が計算機２ｂに引き継が
れる。According to this instruction, the cluster of computer 2a
The computer stop function of the software 16 stops the computer 2a. The operations of the applications 14a to 14c are taken over by the computer 2b by the program transfer function of the cluster software 16 of the computers 2a and 2b.

【００８０】以上説明したように、本実施の形態に係る
クラスタシステム１３においては、監視プログラム１７
がアプリケーション１４ａ〜１４ｃのＡＰＩ呼び出しに
より、まとめてアプリケーション１４ａ〜１４ｃの監視
を行う。As described above, in the cluster system 13 according to the present embodiment, the monitoring program 17
Collectively monitors the applications 14a to 14c by calling the APIs of the applications 14a to 14c.

【００８１】また、アプリケーション１４ａ〜１４ｃの
ＡＰＩ呼び出しに対する異常動作、ストールを検出した
場合に当該異常の発生したアプリケーションの再起動が
実行され、さらに異常がある場合に待機系の計算機２ｂ
へアプリケーション１４ａ〜１４ｃの業務が自動的に引
き継がれる。Further, when an abnormal operation in response to the API call of the applications 14a to 14c or stall is detected, the application in which the abnormality has occurred is restarted, and when there is further abnormality, the standby computer 2b
The operations of the applications 14a to 14c are automatically taken over.

【００８２】これにより、先で述べた第１の実施の形態
と同様に、オペレータの作業を軽減させ、さらにシステ
ムの信頼性、可用性が向上される。Thus, as in the first embodiment described above, the work of the operator is reduced, and the reliability and availability of the system are further improved.

【００８３】なお、本実施の形態においては、２台の計
算機２ａ、２ｂによりクラスタシステム１３が構成され
る場合を例として説明しているが、これに限定されるも
のではなく、３台以上の計算機によりクラスタシステム
が構成される場合にも同様に適用可能である。同様に、
クラスタシステム上で動作するアプリケーションの数に
も、特に制限はなくいくつであってもよい。In the present embodiment, the case where the cluster system 13 is composed of two computers 2a and 2b has been described as an example. However, the present invention is not limited to this. The present invention is also applicable to a case where a cluster system is configured by computers. Similarly,
The number of applications that operate on the cluster system is not particularly limited, and may be any number.

【００８４】また、本実施の形態に係るクラスタシステ
ム１３は、同様の作用・機能を実現可能であれば各構成
要素の配置を変更させてもよく、また各構成要素を自由
に組み合わせてもよい。In the cluster system 13 according to the present embodiment, the arrangement of each component may be changed as long as the same operation and function can be realized, and each component may be freely combined. .

【００８５】また、上記第１及び第２の実施の形態に係
るクラスタシステム８、１３におけるクラスタ・ソフト
ウェア１０、１６は、コンピュータに実行させることの
できるプログラムとして、例えば磁気ディスク（フロッ
ピー（登録商標）ディスク、ハードディスク等）、光デ
ィスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリなど
の記憶媒体に書き込んで適用したり、通信媒体により伝
送して計算機あるいは計算機システムに適用することも
可能である。上記各機能を実現するコンピュータは、記
憶媒体に記憶されたプログラムを読み込み、プログラム
によって動作が制御されることにより、上述した処理を
実行する。The cluster software 10 and 16 in the cluster systems 8 and 13 according to the first and second embodiments are, for example, magnetic disks (floppy (registered trademark)) as programs that can be executed by computers. The present invention can be applied to a storage medium such as a disk or a hard disk), an optical disk (CD-ROM, DVD or the like), a semiconductor memory, or the like, or can be transmitted to a communication medium and applied to a computer or a computer system. A computer that realizes each of the above functions reads the program stored in the storage medium, and executes the above-described processing by controlling the operation of the program.

【００８６】[0086]

【発明の効果】以上詳記したように本発明においては、
プログラムの種別に関係なく、計算機上で動作するプロ
グラムが正常か否かを監視する監視手段と、異常の発生
したプログラムの再起動に失敗した場合に、その計算機
上で動作する移動可能なプログラムを他の計算機に自動
的に移転するプログラム移転手段とを備えている。As described above, in the present invention,
A monitoring means for monitoring whether a program operating on a computer is normal regardless of the type of the program, and a movable program operating on the computer when restarting of a program in which an error has occurred has failed. Program transfer means for automatically transferring to another computer.

【００８７】このように、監視手段がプログラムの種別
に関係なく各プログラムの動作をまとめて監視すること
で、各プログラム専用の監視プログラムを導入する必要
がない。As described above, since the monitoring means collectively monitors the operation of each program regardless of the type of the program, it is not necessary to introduce a monitoring program dedicated to each program.

【００８８】したがって、新規のプログラム導入時のオ
ペレータの労力を低減させることができる。Therefore, the labor of the operator when introducing a new program can be reduced.

【００８９】また、自動的に再起動できなかったプログ
ラムを他の計算機に移転することで、プログラムの移転
が遅れることを防止することができ、これによりシステ
ムの信頼性、可用性を向上させることができる。Also, by transferring a program that could not be automatically restarted to another computer, it is possible to prevent the transfer of the program from being delayed, thereby improving the reliability and availability of the system. it can.

【００９０】また、再起動しても異常が発生するのは、
プログラムの異常よりも計算機やＯＳの異常である可能
性が高いため、この計算機上の移動可能なプログラムを
全て移転することでさらに信頼性、可用性を向上させる
ことができる。Also, an abnormality occurs even after the restart,
Since it is more likely that a computer or OS is abnormal than a program is abnormal, the reliability and availability can be further improved by transferring all movable programs on the computer.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態に係るクラスタシス
テムの概略を例示するブロック図。FIG. 1 is a block diagram schematically illustrating a cluster system according to a first embodiment of the present invention.

【図２】同実施の形態に係るクラスタシステムの動作を
示すフロー図。FIG. 2 is a flowchart showing an operation of the cluster system according to the embodiment;

【図３】本発明の第２の実施の形態に係るクラスタシス
テムの概略を例示するブロック図。FIG. 3 is a block diagram illustrating an outline of a cluster system according to a second embodiment of the present invention;

【図４】同実施の形態に係るクラスタシステムの動作を
示すフロー図。FIG. 4 is a flowchart showing an operation of the cluster system according to the embodiment;

【図５】従来のクラスタシステムの概略を例示するブロ
ック図。FIG. 5 is a block diagram illustrating an outline of a conventional cluster system.

[Explanation of symbols]

１、８、１３…クラスタシステム２ａ、２ｂ…計算機３…通信回線４…オペレーティング・システム５、１０、１６…クラスタ・ソフトウェア６ａ〜６ｃ、１４ａ〜１４ｃ…アプリケーション・プロ
グラム１５ａ〜１５ｃ…アプリケーション・インターフェイス７ａ〜７ｃ…アプリケーション専用監視プログラム９…記憶領域１１…クラスタ・ソフトウェア本体１１ａ…識別情報取得機能１１ｂ…プログラム移転機能１１ｃ…計算機停止機能１２、１７…監視プログラム１２ａ…監視機能１２ｂ…再起動機能１２ｃ…プログラム移転指示機能1, 8, 13: Cluster system 2a, 2b: Computer 3: Communication line 4: Operating system 5, 10, 16: Cluster software 6a to 6c, 14a to 14c: Application program 15a to 15c: Application interface 7a 7c: Application-specific monitoring program 9: Storage area 11: Cluster software body 11a: Identification information acquisition function 11b: Program transfer function 11c: Computer stop function 12, 17: Monitoring program 12a: Monitoring function 12b: Restart function 12c ... Program transfer instruction function

フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ０６Ｆ 15/177 ６７８Ｇ０６Ｆ 15/177 ６７８Ｂ６７８ＡＦターム(参考） 5B034 BB02 CC01 DD02 5B042 GA11 JJ15 KK05 5B045 GG01 JJ02 JJ44 JJ45 5B098 AA10 GA02 GC01 JJ02 JJ08Continued on the front page (51) Int.Cl. ⁷ Identification code FI Theme coat II (Reference) G06F 15/177 678 G06F 15/177 678B 678A F term (Reference) 5B034 BB02 CC01 DD02 5B042 GA11 JJ15 KK05 5B045 GG01 JJ02BJ0JJ45 JJ45JJ45 AA10 GA02 GC01 JJ02 JJ08

Claims

[Claims]

1. A cluster system for monitoring an operation status of a program operating in an environment in which a plurality of computers are connected, comprising: identification information acquiring means for acquiring identification information of a program operating in the environment; Monitoring means for monitoring whether or not the program indicated by the identification information acquired by the acquisition means is normal; and, when an abnormality of the program is detected by the monitoring by the monitoring means, executing the program in which the abnormality is detected. A restarting means for restarting the program in which the abnormality was detected on the computer which was performing the operation, and executing the program in which the abnormality was detected when an abnormality was detected in the program restarted by the restarting means. The program running on the computer that was running
A cluster system comprising: a program transfer unit to be executed on another computer.

2. A cluster system that operates in an environment in which a plurality of computers are connected and monitors the operation status of an application program having an application interface, and acquires identification information of the application program operating in the environment. Identification information obtaining means, monitoring means for calling an application interface of an application program indicated by the identification information obtained by the identification information obtaining means and monitoring whether the application program is normally executed, and monitoring by the monitoring means Restart means for restarting the application program in which the abnormality was detected on a computer which was executing the application program in which the abnormality was detected, when an abnormality in the application program was detected, When an abnormality of the application program restarted by the restart means is detected, the program operating on the computer that was executing the application program in which the abnormality was detected is moved to another computer. A cluster system comprising: a program transfer unit to be executed.

3. An identification information acquisition function for causing a computer to acquire identification information of a program operating in an environment in which a plurality of computers are combined, and whether a program indicated by the identification information acquired by the identification information acquisition function is normal. A monitoring function for monitoring whether the abnormality is detected, and when an abnormality of the program is detected by monitoring by the monitoring function, the program on which the abnormality was detected is re-executed on the computer that was executing the program on which the abnormality was detected. A restart function to be started, and, when an abnormality of the program restarted by the restart function is detected, the program operating on the computer that was executing the program in which the abnormality was detected,
A computer-readable storage medium storing a program for realizing a program transfer function to be executed on another computer.

4. An identification information acquisition function for causing a computer to operate in an environment in which a plurality of computers are coupled and acquire identification information of an application program having an application interface, and identification information acquired by the identification information acquisition function A monitoring function for invoking the application interface of the application program indicated by, and monitoring whether or not the application program is normally executed; and, when an abnormality of the application program is detected by the monitoring by the monitoring function, the abnormality of the abnormality is detected. A restart function for restarting the application program in which the abnormality was detected on the computer that was executing the detected application program, and an abnormality in the application program restarted by the restart function. When issued, a program for realizing a program transfer function for executing a program running on a computer that was executing the application program in which the abnormality was detected and executing the program on another computer was stored. Computer readable storage medium.