JPWO2015104841A1

JPWO2015104841A1 - MULTISYSTEM SYSTEM AND MULTISYSTEM SYSTEM MANAGEMENT METHOD

Info

Publication number: JPWO2015104841A1
Application number: JP2015556697A
Authority: JP
Inventors: 和彦小俣; 信孝岡本; 貴文秦泉寺
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2014-01-10
Filing date: 2014-01-10
Publication date: 2017-03-23
Anticipated expiration: 2034-01-10
Also published as: AU2014376751A1; EP3093766A1; WO2015104841A1; AU2014376751B2; US10055004B2; JP6130520B2; US20160349830A1; SG11201602367WA; EP3093766A4; CN105579973A

Abstract

【課題】多重系システムにおける障害検知構成を多重化して、障害発生を的確に検知し、必要な系切替動作を確実に実行可能とする。【解決手段】多重系システム１０において、多重化された各コンピュータ１５０、１８０の電源機構２００が、当該電源機構２００の記憶装置２０１に対する、他装置３００ないし該当コンピュータ１５０、１８０の他機構１１２からの所定情報の書込処理を監視し、当該書込処理が所定規則に対応したものでなかった場合、電源装置２３０の停止ないしリセットの動作を実行し、当該動作実行後、各コンピュータ１５０、１８０のうち他方のコンピュータに対して回復動作の指示を実行する演算装置２０４を備える構成とする。A fault detection configuration in a multiplex system is multiplexed to accurately detect the occurrence of a fault and to reliably execute a necessary system switching operation. In a multiplex system 10, a power supply mechanism 200 of each multiplexed computer 150, 180 is connected to a storage device 201 of the power supply mechanism 200 from another device 300 or another mechanism 112 of the corresponding computer 150, 180. The writing process of the predetermined information is monitored, and if the writing process does not correspond to the predetermined rule, the operation of stopping or resetting the power supply device 230 is executed. Among them, a configuration is provided in which an arithmetic unit 204 is provided that issues a recovery operation instruction to the other computer.

Description

本発明は、多重系システムおよび多重系システム管理方法に関する。 The present invention relates to a multiplex system and a multiplex system management method.

例えば金融機関の基幹システムなど安易なシステムダウンが許容されないミッションクリティカルなシステムは、クラスタ構成すなわち多重系システムであることが一般的である。こうした多重系システムでは、現用系と待機系の各装置が相互監視を行い、現用系に関する異常検知に応じて待機系を現用系に切替えるといった運用がなされる。 For example, a mission-critical system that does not allow easy system down, such as a basic system of a financial institution, is generally a cluster configuration, that is, a multi-system. In such a multiplex system, the active system and the standby system perform mutual monitoring, and the standby system is switched to the active system in response to detection of an abnormality related to the active system.

上述のような多重系システムの監視、運用の技術としては、例えば以下のようなものが提案されている。すなわち、クライアント端末が、二重化されたネットワーク管理システムにアクセスしてシステム切替えを監視する監視プログラムを取得する過程と、取得した監視プログラムを起動する過程と、起動した監視プログラムにより二重化されたネットワーク管理システムに定期的にアクセスし、その応答によりネットワーク管理システムが切り替わったことを検出する過程を実行する技術（特許文献１参照）などである。 For example, the following technologies have been proposed as techniques for monitoring and operating a multiplex system as described above. That is, a process in which a client terminal accesses a duplexed network management system and acquires a monitoring program for monitoring system switching, a process of starting the acquired monitoring program, and a network management system duplexed by the started monitoring program A technique for periodically accessing the network and executing a process of detecting that the network management system has been switched based on the response (see Patent Document 1).

特開２００５−４４０４号公報Japanese Patent Laid-Open No. 2005-4404

現状、確かに現用系と待機系とでシステムの多重化が図られているが、一方で、異常検知およびそれに伴う系切替を担う機構は多重化されていない。そのため、該当機構に障害が発生すると、多重系システムにおける異常検知動作が行われなくなり、系切替動作の契機自体も発生せずにそのままサービス停止につながる場合も生じうる。つまり、異常検知と系切替の機構が単一障害点となり、現用系と待機系からなるシステム多重化の効果を根本的に毀損する懸念が残されている。 At present, the system is surely multiplexed between the active system and the standby system, but on the other hand, the mechanism responsible for the abnormality detection and the system switching associated therewith is not multiplexed. For this reason, when a failure occurs in the corresponding mechanism, the abnormality detection operation in the multi-system system is not performed, and the service switching operation may not be generated and the service may be stopped as it is. In other words, there is a concern that the mechanism of abnormality detection and system switching becomes a single point of failure, and the effect of system multiplexing consisting of the active system and the standby system is fundamentally impaired.

そこで本発明の目的は、多重系システムにおける障害検知構成を多重化して、障害発生を的確に検知し、必要な系切替動作を確実に実行可能とする技術を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide a technology that can multiplex failure detection configurations in a multiplex system, accurately detect the occurrence of a failure, and reliably execute a necessary system switching operation.

上記課題を解決する本発明の多重系システムは、多重化された各コンピュータの電源機構が、当該電源機構の記憶装置に対する、他装置ないし該当コンピュータの他機構からの所定情報の書込処理を監視し、前記書込処理が所定規則に対応したものでなかった場合、電源の停止ないしリセットの動作を実行し、当該動作実行後、前記各コンピュータのうち他方のコンピュータに対して回復動作の指示を実行する演算装置を備えるものであることを特徴とする。なお、多重系システムで従来から備わっているクラスタリングソフトによる相互監視機能は、上述の各コンピュータにおいても当然備わっているものとする（以下同様）。 In the multiplex system of the present invention that solves the above problems, the power supply mechanism of each multiplexed computer monitors the writing process of predetermined information from another device or another mechanism of the computer to the storage device of the power supply mechanism. If the writing process does not correspond to the predetermined rule, the power supply is stopped or reset, and after the operation is performed, the other computer is instructed to perform the recovery operation. It is provided with the arithmetic unit to perform. It should be noted that the mutual monitoring function by clustering software that is conventionally provided in a multiplex system is naturally provided in each of the above-described computers (the same applies hereinafter).

また、本発明の多重系システム管理方法は、多重化された各コンピュータの電源機構が、当該電源機構の記憶装置に対する、他装置ないし該当コンピュータの他機構からの所定情報の書込処理を監視し、前記書込処理が所定規則に対応したものでなかった場合、電源の停止ないしリセットの動作を実行し、当該動作実行後、前記各コンピュータのうち他方のコンピュータに対して回復動作の指示を実行することを特徴とする。 Also, the multiplexed system management method of the present invention is such that the power supply mechanism of each multiplexed computer monitors the writing process of predetermined information from another device or another mechanism of the computer to the storage device of the power supply mechanism. If the writing process does not correspond to a predetermined rule, a power supply stop or reset operation is executed, and after the operation is executed, a recovery operation instruction is executed for the other computer among the computers. It is characterized by doing.

本発明によれば、多重系システムで従来から備わっているクラスタリングソフトによる相互監視機能に加えて、電源機構での監視機能を更に備えることで、多重系システムにおける障害検知構成を多重化して、障害発生を的確に検知し、必要な系切替動作を確実に実行可能となる。 According to the present invention, in addition to the mutual monitoring function by the clustering software that is conventionally provided in the multiplex system, the failure detection configuration in the multiplex system is multiplexed by further providing a monitoring function in the power supply mechanism. It is possible to accurately detect the occurrence and reliably execute the necessary system switching operation.

第１実施形態の多重系システムを含むネットワーク構成例を示す図である。It is a figure which shows the example of a network structure containing the multiplex system of 1st Embodiment. 第１実施形態のサーバの構成例を示す図である。It is a figure which shows the structural example of the server of 1st Embodiment. 第１実施形態の電源機構の構成例を示す図である。It is a figure which shows the structural example of the power supply mechanism of 1st Embodiment. 第１実施形態の監視テーブルの構成例を示す図である。It is a figure which shows the structural example of the monitoring table of 1st Embodiment. 第１実施形態における多重系システム管理方法の処理手順例１を示すフロー図である。It is a flowchart which shows process sequence example 1 of the multiplex system management method in 1st Embodiment. 第１実施形態における多重系システム管理方法の処理手順例２を示すフロー図である。It is a flowchart which shows the process sequence example 2 of the multiplex system management method in 1st Embodiment. 第２実施形態の多重系システムを含むネットワーク構成例を示す図である。It is a figure which shows the network structural example containing the multiplex system of 2nd Embodiment. 第２実施形態の監視用コンピュータの構成例を示す図である。It is a figure which shows the structural example of the computer for monitoring of 2nd Embodiment. 第２実施形態のサーバの構成例を示す図である。It is a figure which shows the structural example of the server of 2nd Embodiment. 第２実施形態における多重系システム管理方法の処理手順例１を示すフロー図である。It is a flowchart which shows process sequence example 1 of the multiplex system management method in 2nd Embodiment. 第２実施形態における多重系システム管理方法の処理手順例２を示すフロー図である。It is a flowchart which shows process sequence example 2 of the multi-system management method in 2nd Embodiment.

以下に本発明の実施形態について図面を用いて詳細に説明する。図１は第１実施形態の多重系システム１０を含むネットワーク構成例を示す図である。図１に示す多重系システム１０は、障害検知構成を多重化して、障害発生を的確に検知し、必要な系切替動作を確実に実行可能とするためのコンピュータシステムである。 Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a diagram illustrating an example of a network configuration including the multiplex system 10 of the first embodiment. A multiplex system 10 shown in FIG. 1 is a computer system that multiplexes a failure detection configuration, accurately detects the occurrence of a failure, and reliably executes a necessary system switching operation.

ここで想定する多重系システム１０としては、一例として金融機関で運用されている基幹システムを想定する。勿論、多重系システム１０としては金融機関におけるシステムに限定されず、他業界における各種のサーバシステム（クラスタ構成され多重系を成している）を想定可能である。 As an example of the multiplex system 10 assumed here, a backbone system operated in a financial institution is assumed as an example. Of course, the multiplex system 10 is not limited to a system in a financial institution, and various server systems (clustered to form a multiplex system) in other industries can be assumed.

こうした多重系システム１０は、通常時に業務処理を実行する現用系サーバ１５０と、この現用系サーバ１５０に異常が生じた場合に当該現用系サーバ１５０に成り代わる待機系サーバ１８０とを含んでいる。これら現用系サーバ１５０および待機系サーバ１８０は、ネットワーク２０を介して通信可能に結ばれ、既存のクラスタリングソフトにより多重系を構成している。また、これら現用系サーバ１５０および待機系サーバ１８０のそれぞれには、稼働用電源を供給する電源機構２００が付帯している。この電源機構２００は、所定電圧の電源供給や通信を行うためのコネクタで電源供給対象の現用系サーバ１５０および待機系サーバ１８０と接続されているものの、これらサーバ装置とは別構成のハードウェアとなっている。 Such a multi-system 10 includes an active server 150 that executes business processing at normal times, and a standby server 180 that replaces the active server 150 when an abnormality occurs in the active server 150. The active server 150 and the standby server 180 are communicably connected via the network 20 and constitute a multiplex system using existing clustering software. Each of the active server 150 and standby server 180 is accompanied by a power supply mechanism 200 that supplies operating power. The power supply mechanism 200 is connected to the active server 150 and the standby server 180 to be supplied with power by a connector for supplying power and communicating with a predetermined voltage, but has hardware having a configuration different from these server devices. It has become.

続いて、多重系システム１０を構成する現用系サーバ１５０および待機系サーバ１８０のハードウェア構成について説明する。以下、特に区別する必要が無い場合には、現用系サーバ１５０および待機系サーバ１８０を、サーバ１００と総称することとする。図２は、第１実施形態のサーバ１００の構成例を示す図である。 Next, the hardware configuration of the active server 150 and the standby server 180 constituting the multiplex system 10 will be described. Hereinafter, when there is no need to distinguish between them, the active server 150 and the standby server 180 are collectively referred to as the server 100. FIG. 2 is a diagram illustrating a configuration example of the server 100 according to the first embodiment.

多重系システム１０を構成するサーバ１００は、ハードディスクドライブなど適宜な不揮発性記憶装置で構成される記憶装置１０１、ＲＡＭなど揮発性記憶装置で構成されるメモリ１０４、記憶装置１０１に保持されるＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）１０２を起動し、適宜なプログラム１０３を読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵなどの演算装置１０５、ネットワーク２０と接続し他装置との通信処理を担う通信装置１０６、可搬媒体の読み取りドライブ１０７を備える。 A server 100 configuring the multi-system 10 includes a storage device 101 configured by a suitable non-volatile storage device such as a hard disk drive, a memory 104 configured by a volatile storage device such as a RAM, and an OS ( (Operating System) 102 is started, and an appropriate program 103 is read and executed to perform overall control of the device itself and perform various determinations, computations, and control processing, and other devices connected to the network 20 and other devices. A communication device 106 that performs communication processing with the mobile device, and a portable medium reading drive 107.

なお、記憶装置１０１内には、多重系システム１０を構成するサーバ１００として必要な機能を実装する為のＯＳ１０２およびプログラム１０３が記憶されている。このプログラム１０３としては、業務プログラム１１０、クラスタ監視プログラム１１１、および生存通知プログラム１１２が含まれる。このうち業務プログラム１１０は、例えば金融機関の所定業務に対応した処理を実行するためのプログラムである。また、クラスタ監視プログラム１１１は、現用系および待機系のサーバ間相互の異常監視を実行するための既存プログラムであり、既存のクラスタリングソフトに含まれる。また、生存通知プログラム１１２は、電源機構２００の記憶装置２０１に対する所定情報の書込処理を実行するためのプログラムである。 The storage device 101 stores an OS 102 and a program 103 for implementing functions necessary for the server 100 that constitutes the multiplex system 10. The program 103 includes a business program 110, a cluster monitoring program 111, and a survival notification program 112. Among them, the business program 110 is a program for executing processing corresponding to a predetermined business of a financial institution, for example. The cluster monitoring program 111 is an existing program for executing an abnormality monitoring between the active and standby servers, and is included in the existing clustering software. The existence notification program 112 is a program for executing a process for writing predetermined information to the storage device 201 of the power supply mechanism 200.

この場合、サーバ１００の演算装置１０５が上述のクラスタ監視プログラム１１１を実行することで、クラスタ監視機能が実装される。クラスタ監視機能は、現用系サーバ１５０および待機系サーバ１８０の各々に常駐し、従来からのハートビートなどサーバ間で互いに死活監視を行う動作を実現する。 In this case, the cluster monitoring function is implemented by the arithmetic device 105 of the server 100 executing the cluster monitoring program 111 described above. The cluster monitoring function is resident in each of the active server 150 and the standby server 180, and realizes an operation of performing alive monitoring between servers such as a conventional heartbeat.

また、サーバ１００の演算装置１０５が上述の生存通知プログラム１１２を実行することで、生存通知機能が実装される。生存通知機能は、現用系サーバ１５０および待機系サーバ１８０の各々に常駐し、所定情報として、例えば現在時刻情報すなわちタイムスタンプを一定時間間隔でＯＳ１０２のクロック機能等から得て、これを内部信号線３０を介して電源機構２００に対し送信する動作を実現する。 In addition, the life notification function is implemented by the arithmetic device 105 of the server 100 executing the above-described life notification program 112. The existence notification function resides in each of the active server 150 and the standby server 180, and obtains, as predetermined information, for example, current time information, that is, a time stamp, from the clock function of the OS 102 at regular time intervals, and obtains the internal signal line. The operation of transmitting to the power supply mechanism 200 via 30 is realized.

なお、サーバ１００のＯＳ１０２や所定のプログラムが、上述の生存通知プログラム１１２による生存通知機能の稼働状況をモニタリングしておき、生存通知機能においてスローダウンあるいは停止といった何らかの不具合事象発生を検知した場合、所定時間内に生存通知プログラム１１２を再実行して生存通知機能の再起動を行うとすれば好適である。こうした運用を行うことにより、不具合を生じた生存通知機能を速やかに復旧して、速やかに書込処理を再開することができる。多重系システム１０を成すサーバ１００の本来機能（ＯＳ１０２や業務プログラム１１０によう機能など）自体に不具合は発生していないにも関わらず、上述の書込処理の機能のみの不具合に由来する障害検知により系切替が実行される事態を的確に回避出来る。 If the OS 102 of the server 100 or a predetermined program monitors the operation status of the survival notification function by the above-described survival notification program 112 and detects occurrence of some malfunction event such as slowdown or stop in the survival notification function, It is preferable to re-execute the survival notification program 112 within the time to restart the survival notification function. By performing such an operation, it is possible to quickly restore the survival notification function in which a failure has occurred, and to quickly resume the writing process. Detecting a failure resulting from a failure of only the above-described write processing function, even though a failure does not occur in the original function of the server 100 constituting the multiplex system 10 (such as a function such as the OS 102 and the business program 110). Thus, the situation where the system switching is executed can be avoided accurately.

次に、上述のサーバ１００すなわち、現用系サーバ１５０および待機系サーバ１８０のそれぞれに付帯し、稼働電源を供給する電源機構２００のハードウェア構成は以下の如くとなる。図３は第１実施形態の電源機構２００の構成例を示す図である。 Next, the hardware configuration of the power supply mechanism 200 that is attached to each of the above-described server 100, that is, the active server 150 and the standby server 180, and supplies operating power is as follows. FIG. 3 is a diagram illustrating a configuration example of the power supply mechanism 200 according to the first embodiment.

この電源機構２００は、コンピュータの電源ユニットとして一般的に備わるトランスやヒューズ、冷却ファン、ヒートシンクなどからなる電源装置２３０と、この電源装置２３０のオンオフ制御を行う電源制御装置２４０を備えている。 The power supply mechanism 200 includes a power supply device 230 that includes a transformer, a fuse, a cooling fan, a heat sink, and the like that are generally provided as a power supply unit of a computer, and a power supply control device 240 that performs on / off control of the power supply device 230.

このうち電源装置２３０は、上述のサーバ１００におけるマザーボード上のコネクタや、記憶装置１０１や可搬媒体の読み取りドライブ１０７のコネクタと所定のケーブルで接続され、それらに所定電圧の直流を供給する装置となる。なお、上述のケーブルのうち１つの線は、微弱な待機電流が常に流れており、ＷＯＬ（Ｗａｋｅ−ｕｐＯｎＬＡＮ）の信号など、電源供給対象のサーバ１００のチップセット側からの制御信号を、電源制御装置２４０に伝達する信号線としての役割を担っている。本実施例では、この線を内部信号線３０とする。 Among these, the power supply device 230 is connected to a connector on the motherboard in the server 100 described above, a connector of the storage device 101 and the portable medium reading drive 107 with a predetermined cable, and a device that supplies a DC of a predetermined voltage thereto. Become. Note that a weak standby current always flows through one of the cables described above, and a control signal from the chipset side of the server 100 to be supplied with power, such as a WOL (Wake-up On LAN) signal, It plays a role as a signal line to be transmitted to the power supply control device 240. In this embodiment, this line is an internal signal line 30.

また、電源制御装置２４０は、所定プロセッサを備えたシステム管理用コントローラであるＢＭＣ（ＢａｓｅｂｏａｒｄＭａｎａｇｅｍｅｎｔＣｏｎｔｒｏｌｌｅｒ）で構成されている。一般的にこのＢＭＣは、電源装置２３０での供給電圧や冷却ファンの回転数、サーバ１００のＣＰＵ（演算装置１０５）を含む各種パーツの温度といった各種事象について常時監視し、ＯＳ１０２に通知する機能を備えている。このＢＭＣすなわち電源制御装置２４０は、サーバ本体が電源オフ状態であっても、商用電源等の適宜な電源ソースが電源装置２３０に接続されているかぎり電力が供給され、稼働が継続される。つまり電源制御装置２４０は、電源供給対象のサーバ１００におけるＯＳ１０２など上位ソフトウェアとは独立した構成となっている。 The power supply control device 240 is configured by a BMC (Baseboard Management Controller) which is a system management controller having a predetermined processor. In general, the BMC has a function of constantly monitoring various events such as the supply voltage at the power supply device 230, the number of rotations of the cooling fan, and the temperature of various parts including the CPU (arithmetic unit 105) of the server 100 and notifying the OS 102. I have. The BMC, that is, the power supply control device 240 is supplied with electric power as long as an appropriate power source such as a commercial power supply is connected to the power supply device 230 and continues to operate even when the server main body is in a power-off state. In other words, the power control device 240 has a configuration independent of the upper software such as the OS 102 in the server 100 to be supplied with power.

上述したＢＭＣたる電源制御装置２４０は、ＲＯＭなど適宜な不揮発性記憶装置で構成される記憶装置２０１、ＲＡＭなど揮発性記憶装置で構成されるメモリ２０３、記憶装置２０１に保持されるプログラム２０２をメモリ２０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうプロセッサたる演算装置２０４、上述の内部信号線３０を介して電源供給対象たるサーバ１００のチップセットと接続し、サーバ１００のＯＳ１０２との間で通信を行う通信装置２０５を備える。 The above-described power supply control device 240, which is a BMC, includes a storage device 201 configured by an appropriate non-volatile storage device such as a ROM, a memory 203 configured by a volatile storage device such as a RAM, and a program 202 held in the storage device 201. And a control unit 204 as a processor for performing various determinations, computations and control processes, and a chip set of the server 100 to be supplied with power through the internal signal line 30. A communication device 205 that connects and communicates with the OS 102 of the server 100 is provided.

こうした電源機構２００の記憶装置２０１内には、電源機構２００として必要な機能を実装する為のプログラム２０２と、監視テーブル２２５が記憶されている。このうちプログラム２０２としては、テーブル監視プログラム２１０、および電源制御プログラム２１１が含まれている。テーブル監視プログラム２１０は、上述のサーバ１００における生存通知機能から内部通信線３０を介し送信されてきた所定情報、例えばタイムスタンプを監視テーブル２２５に書込処理し、当該書込処理を実行する度に所定タイマーをリセットする動作を繰り返すと共に、監視テーブル２２５における情報更新が一定時間内にあったか否か繰り返し判定するためのプログラムである。また、電源制御プログラム２１１は、テーブル監視プログラム２１０からの通知を受けて、電源装置２３０に対する電源オフないしリセットの動作を実行し、当該動作実行後、待機系サーバ１８０に対して回復動作の指示を行うプログラムである。この電源制御プログラム２１１における電源オフないしリセットの機能は一般的なＢＭＣにおける電源制御機能と同様である。 In the storage device 201 of the power supply mechanism 200, a program 202 for mounting functions necessary for the power supply mechanism 200 and a monitoring table 225 are stored. Among these, the program 202 includes a table monitoring program 210 and a power supply control program 211. The table monitoring program 210 writes predetermined information, for example, a time stamp, transmitted from the survival notification function in the server 100 via the internal communication line 30 to the monitoring table 225, and executes the writing process each time. This is a program for repeatedly determining whether or not the information update in the monitoring table 225 has been performed within a predetermined time while repeating the operation of resetting a predetermined timer. Also, the power supply control program 211 receives a notification from the table monitoring program 210, executes a power-off or reset operation for the power supply device 230, and after executing the operation, instructs the standby server 180 to perform a recovery operation. It is a program to be performed. The power off / reset function in the power control program 211 is the same as the power control function in a general BMC.

電源制御装置２４０の演算装置２０４が上述のテーブル監視プログラム２１０を実行することで、テーブル監視機能が実装される。また、電源制御装置２４０の演算装置２０４が上述の電源制御プログラム２１１を実行することで、電源制御機能が実装される。 A table monitoring function is implemented by the arithmetic device 204 of the power supply control device 240 executing the table monitoring program 210 described above. Further, the arithmetic unit 204 of the power control device 240 executes the power control program 211 described above, thereby implementing a power control function.

この場合、テーブル監視機能は、電源機構２００に常駐し、例えば、サーバ１００の生存通知機能から送信されてくるのがタイムスタンプである場合、このタイムスタンプを内部信号線３０を介して受信する度に監視テーブル２２５に書き込んで更新し続けると共に、監視テーブル２２５におけるタイムスタンプの更新が一定時間内にあったか否か、タイムスタンプ更新ごとにタイマーを起動してモニタリングし、一定時間内のタイムスタンプ更新が継続されるべきとの規則に基づいた判定を実行する。この判定により、上述のタイムスタンプ更新が一定時間内になされなかった時点を検知した場合、テーブル監視機能は、サーバ１００すなわちＯＳ１０２側からの書込処理が滞っていることを認識し、上述の電源制御機能に対し、電源装置２３０の電源オフないしリセットを指示する。 In this case, the table monitoring function is resident in the power supply mechanism 200. For example, when a time stamp is transmitted from the survival notification function of the server 100, the time stamp is received via the internal signal line 30. The monitoring table 225 is continuously written and updated, and whether or not the time stamp in the monitoring table 225 has been updated within a certain time is monitored by starting a timer every time the time stamp is updated. Make a decision based on the rule that it should continue. When it is detected by this determination that the time stamp update is not performed within a predetermined time, the table monitoring function recognizes that the writing process from the server 100, that is, the OS 102 side is delayed, and the power supply described above. The control function is instructed to turn off or reset the power supply device 230.

なお、上述の書込処理で監視テーブル２２５に書き込まれる情報として、図４に示すようにタイムスタンプの例をあげたが、その他にも、書込処理機会で変化しない特定の固定値、あるいは、書込処理機会ごとにインクリメントされる数値、など適宜な規則に応じた様々なものを採用することも出来る。 As information written in the monitoring table 225 in the above-described writing process, an example of a time stamp is given as shown in FIG. 4, but in addition, a specific fixed value that does not change with the writing process opportunity, or Various values according to an appropriate rule, such as a numerical value incremented at each writing processing opportunity, can also be adopted.

書込処理機会で変化しない特定の固定値（例：１）を上述の生存通知機能から受信し、これを監視テーブル２２５に書き込む場合、テーブル監視機能は、固定値の書込を行う度に所定時間内に他の所定値（例：０）で上書き更新する。テーブル監視機能は、この上書き更新を行う度にタイマーを起動し、一定時間内に上書き更新が実行されるべきとの規則に基づいた判定を実行し、上述の上書き更新が一定時間内になされなかった時点を検知した場合、サーバ１００すなわちＯＳ１０２側からの書込処理が滞っていることを認識し、上述の電源制御機能に対し、電源装置２３０の電源オフないしリセットを指示する。 When a specific fixed value (for example, 1) that does not change at a write processing opportunity is received from the above-described survival notification function and written into the monitoring table 225, the table monitoring function is predetermined every time a fixed value is written. Overwrite and update with another predetermined value (eg, 0) within the time. The table monitoring function starts a timer every time this overwriting update is performed, performs a determination based on the rule that the overwriting update should be executed within a certain time, and the overwriting update described above is not performed within the certain time. When the time point is detected, it is recognized that the writing process from the server 100, that is, the OS 102 side is delayed, and the power supply control function is instructed to turn off or reset the power supply device 230.

また、監視テーブル２２５に対し、書込処理機会ごとにインクリメントされる数値を書き込む場合、テーブル監視機能は、上述の生存通知機能から受けた数値の書込を行う度にタイマーを起動し、一定時間内に更にインクリメントされた数値の書き込みが実行されるべきとの規則に基づいた判定を実行し、上述の数値の書き込みが一定時間内になされなかった時点を検知した場合、サーバ１００すなわちＯＳ１０２側からの書込処理が滞っていることを認識し、上述の電源制御機能に対し、電源装置２３０の電源オフないしリセットを指示する。 In addition, when a numerical value that is incremented for each write processing opportunity is written to the monitoring table 225, the table monitoring function starts a timer each time the numerical value received from the above-described survival notification function is written, When a determination is made based on the rule that writing of a numerical value further incremented within should be executed and when the above-described numerical value writing is not performed within a certain time, the server 100, that is, the OS 102 side Recognizing that the writing process is delayed, the power supply control function is instructed to turn off or reset the power supply device 230.

なお、上述のテーブル監視プログラム２１０によるテーブル監視機能と監視テーブル２２５の組み合わせは、いわゆるウォッチドッグタイマとみなすこともできる。 The combination of the table monitoring function by the table monitoring program 210 and the monitoring table 225 can be regarded as a so-called watchdog timer.

以下、本実施形態における多重系システム管理方法の実際手順について図に基づき説明する。以下で説明する多重系システム管理方法に対応する各種動作は、多重系システム１０を構成する、上述のサーバ１００および電源機構２００が各々実行するプログラムによって実現される。そして各プログラムは、以下に説明される各種の動作を行うためのコードから構成されている。 The actual procedure of the multiplex system management method in this embodiment will be described below with reference to the drawings. Various operations corresponding to the multisystem management method described below are realized by programs executed by the server 100 and the power supply mechanism 200, respectively, constituting the multisystem 10. Each program is composed of codes for performing various operations described below.

図５は、本実施形態における多重系システム管理方法の処理手順例１を示すフロー図である。ここで、多重系システム１０における現用系サーバ１５０が、業務プログラム１１０により金融機関の所定業務処理を継続的に実行中であると共に、この現用系サーバ１５０と待機系サーバ１８０は、上述のクラスタ監視プログラム１１１によるクラスタ監視機能で従来からのハートビートによる死活監視を互いに行っている状況にあるとする。また、このクラスタ監視機能による従来の死活監視と平行し、現用系サーバ１５０および待機系サーバ１８０に常駐する上述の生存通知機能が、内部信号線３０経由で電源機構２００に対してタイムスタンプを一定時間毎に送信しているものとする。 FIG. 5 is a flowchart showing a processing procedure example 1 of the multiplex system management method in the present embodiment. Here, the active server 150 in the multiplex system 10 is continuously executing a predetermined business process of the financial institution by the business program 110, and the active server 150 and the standby server 180 are connected to the cluster monitor described above. Suppose that the cluster monitoring function by the program 111 is in a situation where life and death monitoring using a heartbeat is performed. Further, in parallel with the conventional alive monitoring by the cluster monitoring function, the above-described survival notification function resident in the active server 150 and the standby server 180 keeps the time stamp constant for the power supply mechanism 200 via the internal signal line 30. It is assumed that it is transmitted every hour.

こうした状況下において、現用系サーバ１５０は、自身に常駐している生存通知機能が発したタイムスタンプの値を、当該現用系サーバ１５０のマザーボード上のコネクタから内部信号線３０を経由し、電源機構２００における電源制御装置２４０に送信する（ｓ１００）。 Under such circumstances, the active server 150 sends the value of the time stamp generated by the existence notification function residing in itself to the power supply mechanism via the internal signal line 30 from the connector on the motherboard of the active server 150. It transmits to the power supply control device 240 in 200 (s100).

一方、電源機構２００における電源制御装置２４０は、上述の現用系サーバ１５０の生存通知機能から送信されてきたタイムスタンプを、記憶装置２０１の監視テーブル２２５に書込処理する（ｓ１０１）と共に、テーブル監視プログラム２１０によるテーブル監視機能によって、監視テーブル２２５へのタイムスタンプの書込処理タイミングを検知し、当該検知に応じて、所定時間でタイムアップするタイマーをリセットし、経時計測を開始する（ｓ１０２）。 On the other hand, the power supply control device 240 in the power supply mechanism 200 writes the time stamp transmitted from the above-described liveness notification function of the active server 150 into the monitoring table 225 of the storage device 201 (s101) and table monitoring. The table monitoring function by the program 210 detects the timing for writing the time stamp to the monitoring table 225, and in response to the detection, resets the timer that times up by a predetermined time and starts measuring over time (s102).

上述のタイマーが起動された状態における電源制御装置２４０は、上述のテーブル監視機能により、タイマーにおけるタイムアップまでの所定時間中、監視テーブル２２５でのタイムスタンプの次なる書込処理、すなわち更新事象を監視する（ｓ１０３）。このタイムアップまでの監視中に、新たなタイムスタンプが生存通知機能から送られてきて、監視テーブル２２５でのタイムスタンプ更新を行った場合（ｓ１０４：ＯＫ）、電源制御装置２４０は、当該タイムスタンプ更新に応じて、処理をステップｓ１０２に戻し、上述のタイマーをリセットして経時計測を再度開始する。 The power supply control device 240 in the state in which the timer is started, by the above-described table monitoring function, performs the next writing process of the time stamp in the monitoring table 225, that is, the update event, for a predetermined time until the timer expires. Monitor (s103). If a new time stamp is sent from the survival notification function during the monitoring up to this time up and the time stamp is updated in the monitoring table 225 (s104: OK), the power supply control device 240 displays the time stamp. In response to the update, the process returns to step s102, the above-mentioned timer is reset, and the time measurement is started again.

他方、このタイムアップまでの監視中に、新たなタイムスタンプを生存通知機能から受信出来ず、監視テーブル２２５でのタイムスタンプ更新が無かった場合（ｓ１０４：ＮＧ）、電源制御装置２４０のテーブル監視機能は、現用系サーバ１５０すなわちＯＳ１０２においてタイムスタンプ発行が出来ない何らかの障害が発生していると認識し、電源制御プログラム２１１による電源制御機能に対し、電源装置２３０の電源オフないしリセットを指示する（ｓ１０５）。この電源オフないしリセットの指示を受けた電源制御機能は、電源装置２３０を電源オフないしリセットさせる（ｓ１０６）。この電源装置２３０を電源オフないしリセットさせる動作は従来の電源制御動作と同様である。 On the other hand, if a new time stamp cannot be received from the survival notification function during monitoring up to this time up and the time stamp is not updated in the monitoring table 225 (s104: NG), the table monitoring function of the power supply control device 240 Recognizes that a failure has occurred in the active server 150, that is, the OS 102, that cannot issue the time stamp, and instructs the power control function by the power control program 211 to turn off or reset the power supply device 230 (s105). ). Upon receiving this power-off or reset instruction, the power supply control function turns off or resets the power supply device 230 (s106). The operation of turning off or resetting the power supply device 230 is the same as the conventional power supply control operation.

電源制御装置２４０は、電源制御機能により、上述の電源装置２３０での電源オフないしリセットの動作完了を検知し、内部信号線３０およびネットワーク２０を経由して、待機系サーバ１８０に対する回復動作の指示を実行する（ｓ１０７）。この指示を受けた待機系サーバ１８０は従来同様の手順で、現用系サーバ１５０から速やかに業務処理を受け継いで、新たな現用系として稼働を開始することとなる。 The power supply control device 240 detects the completion of the power-off or reset operation in the power supply device 230 by the power supply control function, and instructs the standby server 180 via the internal signal line 30 and the network 20 to perform a recovery operation. Is executed (s107). In response to this instruction, the standby server 180 immediately takes over the business process from the active server 150 in the same procedure as before, and starts operating as a new active server.

なお、上述した、監視テーブル２２５でのタイムスタンプ更新を監視する動作フローとは別に、従来のクラスタ監視機能による死活監視で異常発生が検知された場合もステップｓ１０７と同様に、待機系サーバ１８０が現用系サーバ１５０に成り代わり、新たな現用系として稼働する動作フローが実行される。この処理については従来同様であるので説明を省略する。いずれにしても、異常発生を早く検知した方の動作フローが待機系サーバ１８０による回復動作に至る処理を実行する。 In addition to the above-described operation flow for monitoring the time stamp update in the monitoring table 225, when the occurrence of an abnormality is detected in the life and death monitoring by the conventional cluster monitoring function, the standby server 180 also performs the same as in step s107. Instead of the active server 150, an operation flow that operates as a new active server is executed. Since this process is the same as in the prior art, the description thereof is omitted. In any case, a process is executed in which the operation flow in which the abnormality is detected earlier leads to the recovery operation by the standby server 180.

また、現用系サーバ１５０および待機系サーバ１８０におけるＯＳ１０２等の本来機能、およびそれを実現するハードウェアに異常は無く、生存通知プログラム１１２による生存通知機能にのみ不具合が生じた場合、特に対応動作を行わないと、上述のタイムスタンプの送信、それに伴う監視テーブル２２５でのタイムスタンプ更新が実行されないことになり、無意味な回復動作が実行される事態となる。 In addition, when there is no abnormality in the original functions of the OS 102 and the like in the active server 150 and the standby server 180 and the hardware that realizes them, and only a failure occurs in the survival notification function by the survival notification program 112, the corresponding operation is performed. Otherwise, the transmission of the time stamp and the time stamp update in the monitoring table 225 associated therewith will not be executed, and a meaningless recovery operation will be executed.

そこで、図６のフローにて示すように、現用系サーバ１５０および待機系サーバ１８０のＯＳ１０２や所定のプログラムは、上述の生存通知プログラム１１２による生存通知機能の稼働状況を常にモニタリングし（ｓ２００）、生存通知機能においてスローダウンあるいは停止といった何らかの不具合事象発生を検知した場合（ｓ２０１：Ｙ）、所定時間内に生存通知プログラム１１２を再実行して生存通知機能の再起動を行う（ｓ２０２）。この一連の処理は、上述のステップｓ１００〜ｓ１０７の処理とは平行に実行されているものとする。 Therefore, as shown in the flow of FIG. 6, the OS 102 and the predetermined program of the active server 150 and the standby server 180 always monitor the operating status of the survival notification function by the above-described survival notification program 112 (s200). If any occurrence of a malfunction such as slowdown or stop is detected in the survival notification function (s201: Y), the survival notification program 112 is re-executed within a predetermined time to restart the survival notification function (s202). It is assumed that this series of processes is executed in parallel with the processes in steps s100 to s107 described above.

こうした運用を行うことにより、不具合を生じた生存通知機能を速やかに復旧して、速やかにタイムスタンプの発行と監視テーブル２２５でのタイムスタンプ更新の処理を再開することができる。 By performing such an operation, it is possible to promptly restore the existence notification function in which a failure has occurred, and to quickly restart the time stamp issuance and time stamp update processing in the monitoring table 225.

続いて、第１実施形態とは異なり、図７に例示するように、ネットワーク２０を介し現用系サーバ１５０および待機系サーバ１８０と通信可能な監視用コンピュータ３００が生存通知機能を実装する第２実施形態について説明する。 Subsequently, unlike the first embodiment, as illustrated in FIG. 7, the monitoring computer 300 that can communicate with the active server 150 and the standby server 180 via the network 20 implements the existence notification function. A form is demonstrated.

この場合、監視用コンピュータ３００のハードウェア構成は以下のようなものとなる。図８は第２実施形態の監視用コンピュータ３００の構成例を示す図である。監視用コンピュータ３００は、ハードディスクドライブなど適宜な不揮発性記憶装置で構成される記憶装置３０１、ＲＡＭなど揮発性記憶装置で構成されるメモリ３０４、記憶装置３０１に保持されるＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）３０２を起動し、適宜なプログラム３０３を読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵなどの演算装置３０５、ネットワーク２０と接続しサーバ１００との通信処理を担う通信装置３０６を備える。 In this case, the hardware configuration of the monitoring computer 300 is as follows. FIG. 8 is a diagram illustrating a configuration example of the monitoring computer 300 according to the second embodiment. The monitoring computer 300 includes a storage device 301 configured by a suitable non-volatile storage device such as a hard disk drive, a memory 304 configured by a volatile storage device such as a RAM, and an OS (Operating System) 302 held in the storage device 301. It is connected to the arithmetic unit 305 such as a CPU for performing various types of determinations, computations and control processes, and performs communication processing with the server 100 by starting up and executing and controlling the apparatus itself by reading and executing an appropriate program 303. A communication device 306 is provided.

なお、上述のプログラム３０３としては生存通知プログラム３１０が含まれている。この生存通知プログラム３１０は、上述した電源機構２００における監視テーブル２２５に対するタイムスタンプの書込要求を、現用系サーバ１５０および待機系サーバ１８０に対して所定間隔で繰り返し送信するプログラムである。 The above-described program 303 includes a survival notification program 310. The survival notification program 310 is a program that repeatedly transmits a time stamp write request to the monitoring table 225 in the power supply mechanism 200 to the active server 150 and the standby server 180 at predetermined intervals.

この場合、監視用コンピュータ３００の演算装置３０５が上述の生存通知プログラム３１０を実行することで生存通知機能が実装される。生存通知機能は、監視用コンピュータ３００に常駐し、所定情報として、例えば現在時刻情報すなわちタイムスタンプを一定時間間隔でＯＳ３０２のクロック機能等から得て、これを含む書込要求をネットワーク２０を介して現用系サーバ１５０および待機系サーバ１８０に対し送信する動作を実現する。 In this case, the survival notification function is implemented by the computing device 305 of the monitoring computer 300 executing the above-described survival notification program 310. The existence notification function resides in the monitoring computer 300, obtains, as predetermined information, for example, current time information, that is, a time stamp from the clock function of the OS 302 at regular time intervals, and sends a write request including this information via the network 20. An operation of transmitting to the active server 150 and the standby server 180 is realized.

こうしたシステム構成における現用系サーバ１５０および待機系サーバ１８０は、監視用コンピュータ３００からの書込要求を受けるごとに、当該書込要求を電源機構２００に転送することになる。この書込要求の転送処理は、現用系サーバ１５０および待機系サーバ１８０における転送プログラム１１３により実行される。第２実施形態における現用系サーバ１５０および待機系サーバ１８０すなわちサーバ１００のハードウェア構成については図９に示すとおりであるが、この転送プログラム１１３を保持する一方、生存通知プログラム１１２を保持しない構成となっている以外は第１実施形態と同様である。 Each time the active server 150 and the standby server 180 in such a system configuration receive a write request from the monitoring computer 300, the write request is transferred to the power supply mechanism 200. This write request transfer process is executed by the transfer program 113 in the active server 150 and the standby server 180. The hardware configuration of the active server 150 and the standby server 180, that is, the server 100 in the second embodiment is as shown in FIG. 9, but the transfer program 113 is held while the life notification program 112 is not held. It is the same as that of 1st Embodiment except having become.

また、第１実施形態と同様に、監視用コンピュータ３００のＯＳ３０２や所定のプログラムが、上述の生存通知プログラム３１０による生存通知機能の稼働状況をモニタリングしておき、生存通知機能においてスローダウンあるいは停止といった何らかの不具合事象発生を検知した場合、所定時間内に生存通知プログラム３１０を再実行して生存通知機能の再起動を行うとすれば好適である。こうした運用を行うことにより、不具合を生じた生存通知機能を速やかに復旧して、速やかに書込処理を再開することができる。 Similarly to the first embodiment, the OS 302 or a predetermined program of the monitoring computer 300 monitors the operation status of the survival notification function by the above-described survival notification program 310, and slows down or stops in the survival notification function. If any occurrence of a malfunction event is detected, it is preferable that the survival notification program 310 is re-executed within a predetermined time to restart the survival notification function. By performing such an operation, it is possible to quickly restore the survival notification function in which a failure has occurred, and to quickly resume the writing process.

一方、第２実施形態における電源機構２００のハードウェア構成は、第１実施形態での構成と同様であるため説明は省略する。 On the other hand, the hardware configuration of the power supply mechanism 200 in the second embodiment is the same as the configuration in the first embodiment, and thus the description thereof is omitted.

続いて、当該第２実施形態における多重系システム管理方法について説明する。図１０は第２実施形態における多重系システム管理方法の処理手順例１を示すフロー図である。ここで、多重系システム１０における現用系サーバ１５０が、業務プログラム１１０により金融機関の所定業務処理を継続的に実行中であると共に、この現用系サーバ１５０と待機系サーバ１８０は、上述のクラスタ監視プログラム１１１によるクラスタ監視機能で従来からのハートビートによる死活監視を互いに行っている状況にあるとする。また、このクラスタ監視機能による従来の死活監視と平行し、監視用コンピュータ３００に常駐する上述の生存通知機能が、ネットワーク２０経由で上述の書込要求を現用系サーバ１５０に対して一定時間毎に送信しているものとする。 Next, the multiplex system management method in the second embodiment will be described. FIG. 10 is a flowchart showing a processing procedure example 1 of the multiplex system management method in the second embodiment. Here, the active server 150 in the multiplex system 10 is continuously executing a predetermined business process of the financial institution by the business program 110, and the active server 150 and the standby server 180 are connected to the cluster monitor described above. Suppose that the cluster monitoring function by the program 111 is in a situation where life and death monitoring using a heartbeat is performed. In parallel with the conventional alive monitoring by this cluster monitoring function, the above-mentioned life notification function resident in the monitoring computer 300 sends the above write request to the active server 150 via the network 20 at regular intervals. Assume that you are sending.

こうした状況下において、監視用コンピュータ３００は、自身に常駐している生存通知機能が発したタイムスタンプの値を、通信装置３０６を用いてネットワーク２０経由で現用系サーバ１５０に送信する（ｓ３００）。 Under such circumstances, the monitoring computer 300 transmits the value of the time stamp generated by the existence notification function residing in itself to the active server 150 via the network 20 using the communication device 306 (s300).

一方、現用系サーバ１５０は、監視用コンピュータ３００から書込要求を受信し、この書込要求を、上述の転送プログラム１１３による転送機能で、当該現用系サーバ１５０のマザーボード上のコネクタから内部信号線３０を経由し、電源機構２００における電源制御装置２４０に転送する（ｓ３０１）。 On the other hand, the active server 150 receives a write request from the monitoring computer 300, and this write request is transferred from the connector on the motherboard of the active server 150 to the internal signal line by the transfer function by the transfer program 113 described above. 30 to the power supply control device 240 in the power supply mechanism 200 (s301).

電源機構２００における電源制御装置２４０は、上述の現用系サーバ１５０の転送機能から送信されてきた書込要求を受信し、この書込要求が示すタイムスタンプを、記憶装置２０１の監視テーブル２２５に書込処理する（ｓ３０２）と共に、テーブル監視プログラム２１０によるテーブル監視機能によって、監視テーブル２２５へのタイムスタンプの書込処理タイミングを検知し、当該検知に応じて、所定時間でタイムアップするタイマーをリセットし、経時計測を開始する（ｓ３０３）。 The power supply control device 240 in the power supply mechanism 200 receives the write request transmitted from the transfer function of the active server 150 described above, and writes the time stamp indicated by the write request in the monitoring table 225 of the storage device 201. (S302), the table monitoring function of the table monitoring program 210 detects the timing of writing the time stamp to the monitoring table 225, and resets the timer that times up in a predetermined time according to the detection. Then, time measurement is started (s303).

上述のタイマーが起動された状態における電源制御装置２４０は、上述のテーブル監視機能により、タイマーにおけるタイムアップまでの所定時間中、監視テーブル２２５でのタイムスタンプの次なる書込処理、すなわち更新事象を監視する（ｓ３０４）。このタイムアップまでの監視中に、新たなタイムスタンプが転送機能から送られてきて、監視テーブル２２５でのタイムスタンプ更新を行った場合（ｓ３０５：ＯＫ）、電源制御装置２４０は、当該タイムスタンプ更新に応じて、処理をステップｓ３０３に戻し、上述のタイマーをリセットして経時計測を再度開始する。 The power supply control device 240 in the state in which the timer is started, by the above-described table monitoring function, performs the next writing process of the time stamp in the monitoring table 225, that is, the update event, for a predetermined time until the timer expires. Monitor (s304). If a new time stamp is sent from the transfer function during monitoring up to this time up and the time stamp is updated in the monitoring table 225 (s305: OK), the power supply control device 240 updates the time stamp. In response to this, the process returns to step s303, the above-mentioned timer is reset, and the time measurement is started again.

他方、このタイムアップまでの監視中に、新たなタイムスタンプを転送機能から受信出来ず、監視テーブル２２５でのタイムスタンプ更新が無かった場合（ｓ３０５：ＮＧ）、電源制御装置２４０のテーブル監視機能は、現用系サーバ１５０すなわちＯＳ１０２において監視用コンピュータ３００からの書込要求を転送出来ない何らかの障害が発生していると認識し、電源制御プログラム２１１による電源制御機能に対し、電源装置２３０の電源オフないしリセットを指示する（ｓ３０６）。この電源オフないしリセットの指示を受けた電源制御機能は、電源装置２３０を電源オフないしリセットさせる（ｓ３０７）。この電源装置２３０を電源オフないしリセットさせる動作は従来の電源制御動作と同様である。 On the other hand, if a new time stamp cannot be received from the transfer function during monitoring up to this time up and the time stamp is not updated in the monitoring table 225 (s305: NG), the table monitoring function of the power supply control device 240 is The active server 150, that is, the OS 102 recognizes that some trouble that cannot transfer the write request from the monitoring computer 300 has occurred, and the power supply device 230 is turned off or off for the power control function by the power control program 211. A reset is instructed (s306). Upon receiving this power-off or reset instruction, the power supply control function turns off or resets the power supply device 230 (s307). The operation of turning off or resetting the power supply device 230 is the same as the conventional power supply control operation.

電源制御装置２４０は、電源制御機能により、上述の電源装置２３０での電源オフないしリセットの動作完了を検知し、内部信号線３０およびネットワーク２０を経由して、待機系サーバ１８０に対する回復動作の指示を実行する（ｓ３０８）。この指示を受けた待機系サーバ１８０は従来同様の手順で、現用系サーバ１５０から速やかに業務処理を受け継いで、新たな現用系として稼働を開始することとなる。 The power supply control device 240 detects the completion of the power-off or reset operation in the power supply device 230 by the power supply control function, and instructs the standby server 180 via the internal signal line 30 and the network 20 to perform a recovery operation. Is executed (s308). In response to this instruction, the standby server 180 immediately takes over the business process from the active server 150 in the same procedure as before, and starts operating as a new active server.

また、現用系サーバ１５０および待機系サーバ１８０におけるＯＳ１０２等の本来機能、およびそれを実現するハードウェアに異常は無く、転送プログラム１１３による転送機能にのみ不具合が生じた場合、特に対応動作を行わないと、上述のタイムスタンプを含む書込要求の転送、それに伴う監視テーブル２２５でのタイムスタンプ更新が実行されないことになり、無意味な回復動作が実行される事態となる。 In addition, there is no abnormality in the original function of the OS 102 and the like in the active server 150 and the standby server 180 and the hardware that realizes the same, and when a problem occurs only in the transfer function by the transfer program 113, no corresponding operation is performed. Then, the transfer of the write request including the above-mentioned time stamp and the time stamp update in the monitoring table 225 associated therewith are not executed, and a meaningless recovery operation is executed.

そこで、図１１のフローにて示すように、監視用コンピュータ３００のＯＳ３０２や所定のプログラムは、上述の生存通知プログラム３１０による生存通知機能の稼働状況を常にモニタリングし（ｓ４００）、生存通知機能においてスローダウンあるいは停止といった何らかの不具合事象発生を検知した場合（ｓ４０１：Ｙ）、所定時間内に生存通知プログラム３１０を再実行して生存通知機能の再起動を行う（ｓ４０２）。この一連の処理は、上述のステップｓ３００〜ｓ３０８の処理とは平行に実行されているものとする。 Therefore, as shown in the flow of FIG. 11, the OS 302 and the predetermined program of the monitoring computer 300 always monitor the operation status of the survival notification function by the above-described survival notification program 310 (s400), and thrown in the survival notification function. When some trouble event such as down or stop is detected (s401: Y), the survival notification program 310 is re-executed within a predetermined time to restart the survival notification function (s402). It is assumed that this series of processes is executed in parallel with the processes of steps s300 to s308 described above.

こうした運用を行うことにより、不具合を生じた生存通知機能を速やかに復旧して、速やかにタイムスタンプの発行と、これを含む書込要求の送信、ならびに書込要求に伴う監視テーブル２２５でのタイムスタンプ更新の処理を再開することができる。 By performing such an operation, the existence notification function in which a failure has occurred is promptly restored, the time stamp is issued, the write request including this is promptly transmitted, and the time in the monitoring table 225 associated with the write request The stamp update process can be resumed.

以上、本発明を実施するための最良の形態などについて具体的に説明したが、本発明はこれに限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能である。 Although the best mode for carrying out the present invention has been specifically described above, the present invention is not limited to this, and various modifications can be made without departing from the scope of the invention.

こうした本実施形態によれば、多重系システムで従来から備わっているクラスタリングソフトによる相互監視機能に加えて、電源機構での監視機能を更に備えることで、多重系システムにおける障害検知構成を多重化して、障害発生を的確に検知し、ひいては必要な系切替動作を確実に実行可能となる。 According to this embodiment, in addition to the mutual monitoring function by the clustering software conventionally provided in the multiplex system, the failure detection configuration in the multiplex system can be multiplexed by further providing a monitoring function in the power supply mechanism. Therefore, it is possible to accurately detect the occurrence of a failure and to reliably execute a necessary system switching operation.

本明細書の記載により、少なくとも次のことが明らかにされる。すなわち、本実施形態の多重系システムにおいて、前記多重化された各コンピュータが、前記書込処理を、前記電源機構の記憶装置に対して所定間隔で繰り返し実行する演算装置を備えるものであるとしてもよい。 At least the following will be clarified by the description of the present specification. That is, in the multiplex system of the present embodiment, each of the multiplexed computers may include an arithmetic unit that repeatedly executes the writing process on the storage device of the power supply mechanism at a predetermined interval. Good.

これによれば、電源機構の記憶装置に対する書込処理が一定頻度で実行され、この書込処理の途絶事象等を所定規則に対応しない事象として迅速に検知出来ることになる。 According to this, the writing process to the storage device of the power supply mechanism is executed at a constant frequency, and the interruption event or the like of this writing process can be quickly detected as an event not corresponding to the predetermined rule.

また、本実施形態の多重系システムにおいて、前記多重化された各コンピュータの演算装置は、前記書込処理として、現在時刻情報を所定時間ごとに前記電源機構の記憶装置に対して書き込むものであり、前記電源機構の演算装置は、当該電源機構の記憶装置に書込処理された前記現在時刻情報を所定時間ごとに読み取り、前記現在時刻情報が所定時間以上更新されていなかった場合、電源の停止ないしリセットの動作を実行し、当該動作実行後、前記各コンピュータのうち他方のコンピュータに対して回復動作の指示を実行するものである、としてもよい。 Further, in the multiplex system of the present embodiment, the multiplexed computing device writes the current time information to the storage device of the power supply mechanism every predetermined time as the writing process. The arithmetic unit of the power supply mechanism reads the current time information written in the storage device of the power supply mechanism every predetermined time, and stops the power supply when the current time information has not been updated for a predetermined time or more. Alternatively, a reset operation may be executed, and after executing the operation, a recovery operation instruction may be executed for the other computer among the computers.

これによれば、電源機構の記憶装置での現在時刻情報すなわちタイムスタンプの更新が一定時間内にあったか否かといった判定を行うことで、該当コンピュータが上述の書込処理を実行できない何らかの異常状態にあることを簡便かつ確実に検知出来ることになる。 According to this, it is determined whether or not the current time information in the storage device of the power supply mechanism, that is, the time stamp has been updated within a certain period of time, so that the corresponding computer cannot execute the above-described writing process. Something can be detected easily and reliably.

また、本実施形態の多重系システムにおいて、前記多重化された各コンピュータの演算装置は、前記書込処理の実行機能を、所定事象の発生検知に応じて再起動するものであるとしてもよい。 Further, in the multiplexed system of the present embodiment, the multiplexed computing device of each computer may restart the execution function of the writing process in response to detection of occurrence of a predetermined event.

これによれば、上述の書込処理を行う機能（生存通知プログラムにより実装される機能）自体に何らかの不具合が生じた場合に対応して当該機能を再起動し、迅速に書込処理を再開することが可能となる。そのため、多重系システムを成すコンピュータ自体に不具合は発生していないにも関わらず、上述の書込処理の機能のみの不具合に由来する障害検知により系切替が実行される事態を回避出来る。 According to this, in response to a problem occurring in the function for performing the above-described writing process (function implemented by the survival notification program) itself, the function is restarted and the writing process is resumed quickly. It becomes possible. For this reason, it is possible to avoid a situation where system switching is executed due to a failure detection due to the failure of only the function of the above-described writing process even though no failure has occurred in the computers constituting the multiplex system.

また、本実施形態の多重系システムにおいて、前記多重化された各コンピュータと通信する通信装置と、前記電源機構の記憶装置に対する前記所定規則に対応した前記所定情報の書込要求を、前記各コンピュータに対して所定間隔で繰り返し送信する演算装置と、を備えた監視用コンピュータを更に含み、前記多重化された各コンピュータの演算装置は、前記監視用コンピュータからの前記書込要求を受けるごとに、当該書込要求が示す所定情報を、前記電源機構の記憶装置に対して書き込むものである、としてもよい。 Further, in the multiplex system of the present embodiment, a request for writing the predetermined information corresponding to the predetermined rule to the communication device communicating with the multiplexed computers and the storage device of the power supply mechanism is sent to the computers. Each of the multiplexed computing devices receives the write request from the monitoring computer. The predetermined information indicated by the write request may be written to the storage device of the power supply mechanism.

これによれば、多重系システムを成す現用系及び待機系の各コンピュータとは完全に別体の装置すなわち監視用コンピュータから、上述の書込処理に対応した要求を行うことになり、現用系及び待機系での生存通知プログラム自体の破損、停止といった事態とは無関係に障害検知機能が維持されやすくなる。 According to this, a request corresponding to the above-described writing process is made from a completely separate device, that is, a monitoring computer, from each of the active and standby computers constituting the multi-system, The failure detection function is easily maintained regardless of the situation such as damage or stoppage of the survival notification program itself in the standby system.

また、本実施形態の多重系システムにおいて、前記監視用コンピュータの演算装置は、前記書込要求として、現在時刻情報を所定時間ごとに前記電源機構の記憶装置に対して書き込む要求を、前記各コンピュータに送信するものであり、前記多重化された各コンピュータの演算装置は、前記監視用コンピュータからの前記書込要求を受けるごとに、当該書込要求が示す現在時刻情報を、前記電源機構の記憶装置に対して書き込むものである、としてもよい。 In the multiplex system of the present embodiment, the computing device of the monitoring computer sends a request for writing current time information to the storage device of the power supply mechanism at predetermined time intervals as the write request. Each time the multiplexed computing device receives the write request from the monitoring computer, the computing device of each multiplexed computer stores the current time information indicated by the write request in the storage of the power supply mechanism. It is good also as what writes in with respect to an apparatus.

これによれば、電源機構の記憶装置での現在時刻情報すなわちタイムスタンプの更新が一定時間内にあったか否かといった判定を行うことで、該当コンピュータが上述の監視用コンピュータ由来の書込要求に応じた書込処理を実行できない何らかの異常状態にあることを簡便かつ確実に検知出来ることになる。 According to this, by determining whether or not the current time information in the storage device of the power supply mechanism, that is, the time stamp has been updated within a certain time, the corresponding computer responds to the above-described write request from the monitoring computer. Therefore, it is possible to easily and surely detect that there is some abnormal state where the writing process cannot be executed.

また、本実施形態の多重系システムにおいて、前記監視用コンピュータの演算装置は、前記書込要求の実行機能を、所定事象の発生検知に応じて再起動するものであるとしてもよい。 In the multiplex system of this embodiment, the arithmetic unit of the monitoring computer may restart the write request execution function in response to detection of occurrence of a predetermined event.

これによれば、上述の書込要求を行う機能（生存通知プログラムにより実装される機能）自体に何らかの不具合が生じた場合に対応して当該機能を再起動し、迅速に書込処理を再開することが可能となる。そのため、多重系システムを成すコンピュータ自体に不具合は発生していないにも関わらず、監視用コンピュータにおける不具合に由来する障害検知により系切替が実行される事態を回避出来る。 According to this, in response to the occurrence of some malfunction in the function that makes the above-mentioned write request (function implemented by the survival notification program) itself, the function is restarted and the writing process is resumed quickly. It becomes possible. Therefore, it is possible to avoid a situation in which system switching is executed due to failure detection resulting from a failure in the monitoring computer, even though no failure has occurred in the computers constituting the multiplex system.

１０多重系システム
２０ネットワーク
３０内部信号線
１００サーバ（コンピュータ）
１０１記憶装置
１０２ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）
１０３プログラム
１０４メモリ
１０５演算装置
１０６通信装置
１０７ドライブ
１１０業務プログラム
１１１クラスタ監視プログラム
１１２生存通知プログラム
１１３転送プログラム
１５０現用系サーバ
１８０待機系サーバ
２００電源機構
２０１記憶装置
２０２プログラム
２０３メモリ
２０４演算装置
２０５通信装置
２１０テーブル監視プログラム
２１１電源制御プログラム
２２５監視テーブル
２３０電源装置
２４０電源制御装置
３００監視用コンピュータ
３０１記憶装置
３０２ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）
３０３プログラム
３０４メモリ
３０５演算装置
３０６通信装置
３１０生存通知プログラム10 Multiplexing system 20 Network 30 Internal signal line 100 Server (computer)
101 Storage device 102 OS (Operating System)
103 program 104 memory 105 arithmetic device 106 communication device 107 drive 110 business program 111 cluster monitoring program 112 existence notification program 113 transfer program 150 active server 180 standby server 200 power supply mechanism 201 storage device 202 program 203 memory 204 arithmetic device 205 communication device 210 Table monitoring program 211 Power supply control program 225 Monitoring table 230 Power supply device 240 Power supply control device 300 Monitoring computer 301 Storage device 302 OS (Operating System)
303 Program 304 Memory 305 Computing device 306 Communication device 310 Survival notification program

Claims

The power supply of each multiplexed computer is
Monitoring the writing process of the predetermined information from the other device or the other mechanism of the corresponding computer to the storage device of the power supply mechanism, and if the writing process does not correspond to the predetermined rule, the power supply is stopped or reset An operation unit that executes an operation and, after the operation is executed, executes an instruction for a recovery operation to the other computer among the computers,
A multi-system system characterized by this.

Each of the multiplexed computers is
The multiplex system according to claim 1, further comprising an arithmetic unit that repeatedly executes the writing process on the storage device of the power supply mechanism at predetermined intervals.

The computing device of each multiplexed computer is:
As the writing process, current time information is written to the storage device of the power supply mechanism every predetermined time,
The arithmetic unit of the power supply mechanism is
The current time information written to the storage device of the power supply mechanism is read every predetermined time, and when the current time information has not been updated for a predetermined time or longer, a power supply stop or reset operation is executed, After the execution of the operation, a recovery operation instruction is executed with respect to the other computer among the computers.
The multiplex system according to claim 2, wherein:

The computing device of each multiplexed computer is:
The multiplex system according to claim 3, wherein the execution function of the writing process is restarted in response to detection of occurrence of a predetermined event.

A communication device for communicating with each of the multiplexed computers;
An arithmetic device that repeatedly transmits a write request for the predetermined information corresponding to the predetermined rule to the storage device of the power supply mechanism at predetermined intervals to the computers;
Further comprising a monitoring computer comprising:
The computing device of each multiplexed computer is:
Each time the write request is received from the monitoring computer, the predetermined information indicated by the write request is written to the storage device of the power supply mechanism.
The multiplex system according to claim 1.

The computing device of the monitoring computer is
As the write request, a request for writing current time information to the storage device of the power supply mechanism at predetermined time intervals is transmitted to each computer,
The computing device of each multiplexed computer is:
Each time the write request is received from the monitoring computer, the current time information indicated by the write request is written to the storage device of the power supply mechanism.
The multiplex system according to claim 5.

The computing device of the monitoring computer is
7. The multiplex system according to claim 6, wherein the execution function of the write request is restarted in response to detection of occurrence of a predetermined event.

The power supply of each multiplexed computer is
Monitoring the writing process of the predetermined information from the other device or the other mechanism of the corresponding computer to the storage device of the power supply mechanism, and if the writing process does not correspond to the predetermined rule, the power supply is stopped or reset Execute an operation, and after executing the operation, execute an instruction of a recovery operation to the other computer among the computers,
And a multi-system management method.