JP6424134B2

JP6424134B2 - Computer system and computer system control method

Info

Publication number: JP6424134B2
Application number: JP2015088221A
Authority: JP
Inventors: 英宏河合; 晃尾田; 博史峯
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2015-04-23
Filing date: 2015-04-23
Publication date: 2018-11-14
Anticipated expiration: 2035-04-23
Also published as: JP2016206965A

Description

本発明は、計算機システムの多重系制御方法に関する。 The present invention relates to a multi-system control method for a computer system.

多重系の計算機システムは、業務を実行する現用系計算機と、現用系計算機を監視して異常を検知したときにその業務を引き継ぐ待機系計算機から構成される。この現用系計算機で異常が発生したときに、待機系計算機が業務を引き継ぐ一連の処理をフェイルオーバ(もしくは系切り替え)と呼ぶ。 The multi-computer system includes an active computer that executes a job and a standby computer that takes over the job when the active computer is monitored and an abnormality is detected. A series of processes in which the standby computer takes over work when an abnormality occurs in the active computer is called failover (or system switching).

フェイルオーバについては、次の３点が求められる。第一に、異常が発生した原因の解析を支援する目的で、異常を起こした現用系計算機のメモリダンプを採取する。第２に、異常を起こした現用系計算機が不適切なデータ出力を行わないように、待機系計算機が業務を引き継ぐ前に現用系計算機の業務を完全に停止させる、あるいは現用系計算機が停止したことを確認する。第三に、フェイルオーバ中の業務停止時間を短くするために、前記現用系計算機の業務停止を極力早いタイミングで実施する。 The following three points are required for failover. First, in order to support the analysis of the cause of the abnormality, a memory dump of the active computer that caused the abnormality is collected. Secondly, to prevent the active computer that caused the abnormality from outputting inappropriate data, the operation of the active computer was completely stopped before the standby computer took over the operation, or the active computer was stopped. Make sure. Third, in order to shorten the business stop time during failover, the current computer is stopped at the earliest possible timing.

近年のオペレーティングシステム（以下ＯＳ）は、自身で異常を検知した場合、あるいはマスク不可割込み（ＮＭＩ＝Non Maskable Interrupt）を受信した場合、その時点での第一のＯＳのメモリ内容を保持したまま第２のＯＳを起動し、第２のＯＳがメモリダンプとして第一のＯＳのメモリ内容を永続的に保存する。 When a recent operating system (hereinafter referred to as OS) detects an abnormality by itself or receives a non-maskable interrupt (NMI = Non Maskable Interrupt), it retains the memory contents of the first OS at that time. The second OS is activated, and the second OS permanently saves the memory contents of the first OS as a memory dump.

特に多重系システムにおいては、現用系計算機の異常を検知した待機系計算機が、公知または周知の手法で現用系計算機にＮＭＩを入力することで、現用系計算機の業務停止と、メモリダンプ処理の開始を同時に実現する。 Especially in a multi-system, a standby computer that detects an abnormality in the active computer inputs the NMI to the active computer using a known or well-known method, so that the operation of the active computer is stopped and memory dump processing is started. At the same time.

しかし、現用系計算機のＯＳが既にメモリダンプ処理を開始している場合、前記手順に従って待機系計算機が現用系計算機にＮＭＩを入力すると、メモリダンプ処理が停止してしまい、メモリダンプに失敗することがある。 However, if the OS of the active computer has already started memory dump processing, if the standby computer inputs NMI to the active computer according to the above procedure, the memory dump processing stops and the memory dump fails. There is.

メモリダンプ処理を開始した後のＮＭＩ入力を抑止する方法として、特許文献１には、「障害の発生した計算機に搭載された機能拡張ボードが割込み指示メッセージに対して発生する割込みに対する割込み処理において、障害情報の保存を実行し、かつ前記機能拡張ボードに対して、前記割込み発生機能と計算機動作停止機能の抑止を指示し、後から送信される計算機の停止を指示するメッセージを無視して障害情報の保存を継続する」技術が開示されている。 As a method of suppressing NMI input after starting the memory dump process, Patent Document 1 states that “in the interrupt process for an interrupt generated by the function expansion board mounted on the computer in which the fault has occurred in response to the interrupt instruction message, Failure information is stored by executing the saving of the failure information, instructing the function expansion board to inhibit the interrupt generation function and the computer operation stop function, and ignoring the message to stop the computer transmitted later. Technology for "continuing storage" is disclosed.

国際公開第９９／２６１３８号International Publication No. 99/26138

しかしながら、上記特許文献１では、上述した機能を有する機能拡張ボードを必要とするが、このような特別なハードウェアは広く一般に普及しておらず、様々な地域で、かつ安価に入手することは難しい。さらに、標準的なＯＳは、当該特別なハードウェアを利用するためのデバイスドライバをサポートしているとは限らない。 However, in Patent Document 1, a function expansion board having the above-described functions is required. However, such special hardware is not widely spread and is not available in various regions and at low cost. difficult. Furthermore, a standard OS does not always support a device driver for using the special hardware.

そこで本発明は、上記問題点に鑑みてなされたもので、特別なハードウェアを必要とせずに現用系計算機に異常が発生した時に、メモリダンプが強制的に停止されるのを抑止することを目的とする。 Therefore, the present invention has been made in view of the above problems, and it is possible to prevent a memory dump from being forcibly stopped when an abnormality occurs in an active computer without requiring special hardware. Objective.

本発明は、プロセッサとメモリとを有する計算機を複数有し、前記複数の計算機のうち少なくとも一つを現用系計算機とし、他の計算機のうち少なくとも一つを待機系計算機とし、前記現用系計算機と前記待機系計算機を接続する第１のネットワークとを有する計算機システムであって、前記現用系計算機は、当該現用系計算機に接続されて、当該現用系計算機を管理する計算機管理装置と、業務を提供する第１のＯＳと、前記第１のＯＳが稼働中であることを示すハートビートを所定の周期で前記待機系計算機へ送信する通信部と、前記第１のＯＳに障害が発生した後に前記メモリの内容を出力する第２のＯＳと、を有し、前記計算機管理装置は、第２の所定時間までカウントすると所定のアクションを実行するウォッチドッグタイマと、前記ウォッチドッグタイマの稼働状態を設定するタイマ設定状態情報と、第２のネットワークを介して前記待機系計算機に接続されたアダプタと、を有し、前記現用系計算機は、前記業務の開始時に前記計算機管理装置の前記タイマ設定状態情報に前記業務の開始を示す有効状態を設定し、前記アクションとして当該現用系計算機のハードリセットを設定し、前記現用系計算機のプロセッサは、前記第１のＯＳが障害により停止したときには、前記ウォッチドッグタイマのカウントを開始させる開始コマンドを前記計算機管理装置に送信してから前記第２のＯＳを起動し、前記計算機管理装置は、前記開始コマンドを受信して前記タイマ設定状態情報を開始状態に更新し、前記待機系計算機は、前記ハートビートを受信してから第１の所定時間が経過した後に、前記タイマ設定状態情報を取得し、前記タイマ設定状態情報が前記開始状態であれば、前記第１のＯＳが停止したと判定して前記現用系計算機の業務を引き継ぎ、前記現用系計算機の前記第２のＯＳは、第３の所定時間毎に前記ウォッチドッグタイマをリセットし、前記計算機管理装置は、前記第２の所定時間までカウントした場合には前記所定のアクションとして設定された前記ハードリセットを前記現用系計算機へ入力する。 The present invention includes a plurality of computers having a processor and a memory, wherein at least one of the plurality of computers is an active computer, at least one of the other computers is a standby computer, and the active computer A computer system having a first network connecting the standby computer, wherein the active computer is connected to the active computer and provides a computer management device for managing the active computer A first OS that performs communication, a communication unit that transmits a heartbeat indicating that the first OS is operating to the standby computer at a predetermined period, and the failure after the first OS has failed. a second OS for outputting the contents of the memory, and the computer management system, when it counts to a second predetermined time and the watchdog timer for executing a predetermined action, And timer setting state information for setting the operating state of the serial watchdog timer, via the second network have, an adapter connected to said standby computer, the active computer, the at the start of the business A valid state indicating the start of the task is set in the timer setting state information of the computer management apparatus, a hard reset of the active computer is set as the action, and the processor of the active computer has the first OS When stopped due to a failure, the second OS is started after transmitting a start command to start counting of the watchdog timer to the computer management device, and the computer management device receives the start command and receives the start command. update the timer setting state information to the start state, the standby system computer, a first predetermined time from the reception of the heartbeat After There has elapsed, it obtains the timer setting state information, the long timer setting state information is the start state, it is determined that the first OS has stopped taking over the task of the active computer, the working The second OS of the system computer resets the watchdog timer every third predetermined time, and the computer management device is set as the predetermined action when counting up to the second predetermined time. The hard reset is input to the active computer .

本発明によれば、特別なハードウェアを必要とせずに、現用系計算機に異常が発生したときには第２のＯＳによるメモリダンプが強制的に停止されてしまうことを抑止できる。また、待機系計算機は、異常の発生によって停止した現用系計算機が第２のＯＳに移行するタイミングで現用系計算機の第１のＯＳの停止を検知できる。これにより、現用系計算機の異常による停止から短時間で待機系計算機に業務を引き継がせることができる。さらに、現用系計算機及び待機系計算機では、ＯＳの改造や特殊なドライバは不要なため、汎用ＯＳを利用することができる。 According to the present invention, it is possible to prevent the memory dump by the second OS from being forcibly stopped when an abnormality occurs in the active computer without requiring special hardware. Further, the standby computer can detect the stop of the first OS of the active computer at the timing when the active computer that has been stopped due to the occurrence of an abnormality shifts to the second OS. As a result, the standby computer can be handed over in a short time from a stop due to an abnormality in the active computer. Furthermore, since the active computer and the standby computer do not require OS modification or special drivers, a general-purpose OS can be used.

本発明の第１の実施例を示し、多重系の計算機システムの一例を示すブロック図である。1 is a block diagram illustrating an example of a multiplex computer system according to a first embodiment of this invention. FIG. 本発明の第１の実施例を示し、現用系計算機が自発的に停止するケースのタイムチャートである。It is a time chart of the case where the working system computer stops spontaneously, showing the first embodiment of the present invention. 本発明の第１の実施例を示し、現用系計算機が待機系計算機からの指示により停止するケースのタイムチャートである。It is a time chart of the case where the working system computer stops according to an instruction from the standby system computer according to the first embodiment of this invention. 本発明の第２の実施例を示し本発明の第２の実施例を示し、メモリダンプ処理用の第２のＯＳがハングアップした場合に、自動的に第１のＯＳを再起動させる処理の一例を示すタイムチャートである。The second embodiment of the present invention will be described, and the second embodiment of the present invention will be described. When the second OS for memory dump processing hangs up, a process for automatically restarting the first OS is described. It is a time chart which shows an example. 本発明の第１及び第２の実施例を示し、ウォッチドッグタイマの状態遷移図である。FIG. 5 is a state transition diagram of a watchdog timer according to the first and second embodiments of the present invention.

以下、本発明の実施形態を添付図面に基づいて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

図１は、本発明の第１の実施例を示し、多重系の計算機システムの一例を示すブロック図である。多重系の計算機システムは、複数の計算機のうち少なくとも１台の現用系計算機１０と、現用系計算機１０を管理する現用系ＢＭＣ（Baseboard Management Controller）６と、少なくとも１台の待機系計算機２０と、待機系計算機２０を管理する待機系ＢＭＣ７と、現用系計算機１０と待機系計算機２０を接続するネットワーク３２と、現用系計算機１０と待機系ＢＭＣ７を接続し、待機系計算機２０と現用系ＢＭＣ６を接続するネットワーク３１から構成される。 FIG. 1 is a block diagram showing an example of a multiplex computer system according to the first embodiment of this invention. The multi-computer system includes at least one active computer 10 among a plurality of computers, an active BMC (Baseboard Management Controller) 6 that manages the active computer 10, at least one standby computer 20, The standby system BMC 7 that manages the standby system computer 20, the network 32 that connects the active system computer 10 and the standby system computer 20, the active system computer 10 and the standby system BMC 7, and the standby system computer 20 and the active system BMC 6 are connected. Network 31.

現用系計算機１０と待機系計算機２０は同様に構成され、少なくとも、後述するようにＯＳを含むソフトウェアプログラムを実行する演算装置１１と、前記プログラムおよびその実行に必要なデータを格納するメモリ１２と、ハートビート（以下、ＨＢ）通信を行う通信装置１３と、他系ＢＭＣにコマンドを送信するためのＬＡＮアダプタ１４と、データやメモリダンプの内容を格納するストレージ装置１５と、を備える。ここでストレージ装置１５は、例えばハードディスクや不揮発性記憶媒体を含む。また、通信装置１３は、例えばＬＡＮアダプタなどである。 The active computer 10 and the standby computer 20 are configured in the same way, and at least an arithmetic unit 11 that executes a software program including an OS as will be described later, a memory 12 that stores the program and data necessary for the execution, A communication device 13 that performs heartbeat (hereinafter referred to as HB) communication, a LAN adapter 14 that transmits a command to another BMC, and a storage device 15 that stores data and the contents of a memory dump are provided. Here, the storage device 15 includes, for example, a hard disk and a nonvolatile storage medium. The communication device 13 is a LAN adapter, for example.

現用系計算機１０のメモリ１２には、業務（図示省略）を提供するための第１のＯＳ（図中１ｓｔＯＳ）１と、ＨＢの生成などを行うミドルウェア（図中Ｍｉｄｄｌｅ）２と、第１のＯＳ１が停止した後にメモリダンプを実行する第２のＯＳ（図中２ｎｄＯＳ）３とが格納される。現用系計算機１０の演算装置１１は、障害が発生するまでは第１のＯＳ１とミドルウェア２を実行する。なお、第１のＯＳ１上で稼働するアプリケーションやサービスが業務として現用系計算機１０から提供される。 The memory 12 of the active computer 10 includes a first OS (1stOS in the figure) 1 for providing business (not shown), middleware (Middle in the figure) 2 for generating HB, etc., and a first A second OS (2nd OS in the figure) 3 that executes a memory dump after the OS 1 is stopped is stored. The computing device 11 of the active computer 10 executes the first OS 1 and middleware 2 until a failure occurs. Note that applications and services running on the first OS 1 are provided from the active computer 10 as business.

待機系計算機２０のメモリ１２には、ＯＳ４と、現用系計算機１０のＨＢを監視するミドルウェア（図中Ｍｉｄｄｌｅ）５が格納され、演算装置１１によって実行される。 The memory 12 of the standby computer 20 stores the OS 4 and middleware (Middle in the figure) 5 for monitoring the HB of the active computer 10 and is executed by the arithmetic unit 11.

現用系ＢＭＣ６は待機系ＢＭＣ７も同様に構成され、現用系ＢＭＣ６は現用系計算機１０に搭載され、待機系ＢＭＣ７は待機系計算機２０に搭載される。 The active BMC 6 is configured in the same manner as the standby BMC 7. The active BMC 6 is mounted on the active computer 10, and the standby BMC 7 is mounted on the standby computer 20.

現用系ＢＭＣ６はネットワーク３１を介して待機系計算機２０からのコマンドを受信するためのＬＡＮアダプタ１６と、演算装置１９とメモリ８を有する。待機系ＢＭＣ７も同様に構成され、ＬＡＮアダプタ１６が現用系計算機１０からのコマンドを受信する。 The active BMC 6 includes a LAN adapter 16 for receiving a command from the standby computer 20 via the network 31, an arithmetic device 19, and a memory 8. The standby BMC 7 is similarly configured, and the LAN adapter 16 receives a command from the active computer 10.

メモリ８にはウォッチドッグタイマ（WatchDog Timer、以下ＷＤＴとする）１８と、ＷＤＴ１８の状態を保持するＷＤＴ設定状態１７が格納される。ＷＤＴ設定状態１７は、少なくとも「有効状態」と「無効状態」のいずれかが設定される。これらの状態は互いに区別できれば任意の状態や他の値を用いても良い。 The memory 8 stores a watchdog timer (hereinafter referred to as WDT) 18 and a WDT setting state 17 that holds the state of the WDT 18. As the WDT setting state 17, at least one of “valid state” and “invalid state” is set. Any of these states may be used as long as they can be distinguished from each other.

なお、本実施例において、「有効状態」は現用系計算機１０の稼働状態が業務を「開始」していることを示す。また、「無効状態」は現用系計算機１０の稼働状態が業務を「停止」していることを示す。 In this embodiment, the “valid state” indicates that the operating state of the active computer 10 “starts” the business. The “invalid state” indicates that the operating state of the active computer 10 is “stopped”.

現用系計算機１０と現用系ＢＭＣ６の間はシステムインタフェース３０で接続され、例えばシンプルなＩ／Ｏポート操作により通信を行うＫＣＳ（ＫｅｙｂｏａｒｄＣｏｎｔｒｏｌｌｅｒＳｔｙｌｅ）インタフェースが使用される。待機系計算機２０と待機系ＢＭＣ７の間のシステムインタフェース３０も同様である。 The active computer 10 and the active BMC 6 are connected by a system interface 30 and, for example, a KCS (Keyboard Controller Style) interface that performs communication by a simple I / O port operation is used. The system interface 30 between the standby computer 20 and the standby BMC 7 is the same.

現用系計算機１０と待機系計算機２０は、現用系ＢＭＣ６と待機系ＢＭＣ７を含めて対称的な構成をとり、互いに役割を入れ替えて相互に監視制御しても良い。 The active computer 10 and the standby computer 20 may have a symmetric configuration including the active BMC 6 and the standby BMC 7, and may be mutually supervised and controlled by switching roles.

なお、本実施例１では、メモリ８のメモリ８内にＷＤＴ１８を設定する例を示したが、図示しないハードウェアでＷＤＴを構成してもよい。 In the first embodiment, the example in which the WDT 18 is set in the memory 8 of the memory 8 has been described. However, the WDT may be configured by hardware (not shown).

図２は、現用系計算機１０が障害によって自発的に停止するケースのタイムチャートである。図２の例では、現用系計算機１０自身が異常を検知してＯＳ停止処理（カーネルパニック）と、およびメモリダンプ採取のための第２のＯＳ３の起動処理を行うケースにおいて、現用系計算機１０と現用系ＢＭＣ６、および待機系計算機２０の処理の流れを示している。 FIG. 2 is a time chart of a case where the active computer 10 stops spontaneously due to a failure. In the example of FIG. 2, in the case where the active computer 10 itself detects an abnormality and performs OS stop processing (kernel panic) and start-up processing of the second OS 3 for collecting a memory dump, the active computer 10 The flow of processing of the active system BMC 6 and the standby system computer 20 is shown.

前述のとおり、現用系計算機１０と待機系計算機２０はネットワーク３２で接続され、現用系計算機１０と現用系ＢＭＣ６はシステムインタフェース３０で接続され、さらに現用系ＢＭＣ６と待機系計算機２０はネットワーク３１で接続されている。各ステップの詳細は以下の通りである。 As described above, the active computer 10 and the standby computer 20 are connected by the network 32, the active computer 10 and the active BMC 6 are connected by the system interface 30, and the active BMC 6 and the standby computer 20 are connected by the network 31. Has been. Details of each step are as follows.

ステップ１００では、現用系計算機１０にて業務を開始する前に、ミドルウェア２が初期設定としてウォッチドッグタイマ（ＷＤＴ）１８を有効化するコマンドを現用系ＢＭＣ６に送信する。本実施例１ではＷＤＴ１８を有効状態にするだけであり、ＷＤＴ１８のタイマは開始させない。なお、現用系計算機１０は、起動時や業務開始時などで上記初期設定（初期化処理）を行えば良い。 In step 100, before starting work on the active computer 10, the middleware 2 transmits a command for enabling the watchdog timer (WDT) 18 as an initial setting to the active BMC 6. In the first embodiment, only the WDT 18 is enabled, and the timer of the WDT 18 is not started. The active computer 10 may perform the above initial setting (initialization process) at the time of startup, business start, or the like.

ステップ１０１で、現用系ＢＭＣ６は、ＷＤＴ設定状態１７を「有効状態」に設定する。 In step 101, the working BMC 6 sets the WDT setting state 17 to “valid state”.

ステップ１０２、１０２ｎで、現用系計算機１０のミドルウェア５は、所定の時間間隔Ｔｉでネットワーク３２を介してハートビート（ＨＢ）を待機系計算機２０に送信し、現用系計算機１０が正常に稼働中であることを繰り返し通知する。 In steps 102 and 102n, the middleware 5 of the active computer 10 transmits a heartbeat (HB) to the standby computer 20 via the network 32 at a predetermined time interval Ti, and the active computer 10 is operating normally. Notify me repeatedly.

ステップ１０３、１０３ｎで、待機系計算機２０のミドルウェア２は、現用系計算機１０からＨＢメッセージを受信し、受信時刻を記憶する。 In steps 103 and 103n, the middleware 2 of the standby computer 20 receives the HB message from the active computer 10 and stores the reception time.

ステップ１０４で、現用系計算機１０上で稼働する第１のＯＳ１が自身の不整合を検知し、カーネルパニックとなる。現用系計算機１０は、メモリダンプ採取用の第２のＯＳへの移行処理を開始する。 In step 104, the first OS 1 running on the active computer 10 detects its own inconsistency and a kernel panic occurs. The active computer 10 starts the migration process to the second OS for collecting the memory dump.

ステップ１０５では、現用系計算機１０が、前記第１のＯＳ１から前記第２のＯＳ３へ移行する中間で実行されるＯＳ非依存処理で、ＷＤＴ１８を無効化するコマンドを、システムインタフェース３０を介して現用系ＢＭＣ６に送信する。 In step 105, a command for invalidating the WDT 18 is issued via the system interface 30 in the OS-independent processing executed in the middle of the migration of the first OS 1 to the second OS 3. It transmits to system BMC6.

ここでＯＳ非依存処理とは、ＶＧＡなどの標準的に搭載されるデバイスをシンプルなＩ／Ｏポート操作などにより初期化し、第２のＯＳ３の起動に備える処理である。上述のＫＣＳインタフェースによる通信もまたシンプルなＩ／Ｏポート操作により実現できるため、ＯＳ無しの状態で動作するように実装することは容易である。例えば、メモリ１２に格納される第２のＯＳ３の直前に、Ｉ／Ｏポートを操作するコマンドなどをロードしておけば良い。 Here, the OS-independent processing is processing that initializes a standardly mounted device such as VGA by a simple I / O port operation or the like and prepares for the startup of the second OS 3. Since the communication using the above-mentioned KCS interface can also be realized by a simple I / O port operation, it is easy to implement so as to operate without an OS. For example, a command for operating an I / O port may be loaded immediately before the second OS 3 stored in the memory 12.

ステップ１０６で、現用系ＢＭＣ６は、現用系計算機１０から無効化コマンドを受信してＷＤＴ設定状態１７を「無効状態」に更新する。 In step 106, the active BMC 6 receives the invalidation command from the active computer 10 and updates the WDT setting state 17 to “invalid state”.

ステップ１０７で、待機系計算機２０は最後にＨＢを受信してから所定時間Ｔｃ２０２経過したと判定すると、ネットワーク３１を介して現用系ＢＭＣ６に対してＷＤＴ状態取得コマンドを送信する。現用系ＢＭＣ６は、現在のＷＤＴ設定状態１７である「無効状態」を応答する。待機系計算機２０は現用系ＢＭＣ６のＷＤＴ設定状態１７が「無効状態」であることを検知する。 If the standby computer 20 determines in step 107 that the predetermined time Tc 202 has elapsed since the last reception of the HB, it transmits a WDT state acquisition command to the active BMC 6 via the network 31. The active BMC 6 responds with an “invalid state” which is the current WDT setting state 17. The standby computer 20 detects that the WDT setting state 17 of the active BMC 6 is “invalid state”.

ステップ１０８では、現用系計算機１０の異常停止から業務引継ぎまでの時間の要求仕様がＴｆ２０３とした場合、待機系計算機２０は最後にＨＢを受信してからＴｆ２０３時間経過したときに、ステップ１０７で取得したＷＤＴ設定状態１７を判定する。待機系計算機２０は、判定結果が「無効状態」であった場合、現用系計算機１０の業務は完全に停止していると認識し、現用系計算機１０上で実行していた業務の引継ぎを行う。業務の引継ぎについては、周知または公知の技術を適用すれば良いので、ここでは詳述しない。 In step 108, when the required specification of the time from the abnormal stop of the active computer 10 to the business takeover is Tf203, the standby computer 20 is acquired in step 107 when Tf203 time has elapsed since the last reception of HB. The determined WDT setting state 17 is determined. When the determination result is “invalid state”, the standby computer 20 recognizes that the operation of the active computer 10 is completely stopped, and takes over the operation executed on the active computer 10. . As for taking over the business, a known or publicly known technique may be applied, and therefore will not be described in detail here.

ステップ１０９で、現用系計算機１０は、第２のＯＳ３の起動処理を行う。 In step 109, the active computer 10 performs the startup process of the second OS 3.

ステップ１１０で、現用系計算機１０は、第２のＯＳ３の起動処理が完了すると、メモリダンプ処理を開始する。なお、第２のＯＳ３の起動処理も含め、メモリダンプ処理にはkdump等を採用することができる。 In step 110, the active computer 10 starts the memory dump process when the startup process of the second OS 3 is completed. It should be noted that kdump or the like can be employed for the memory dump process including the startup process of the second OS 3.

ステップ１１１で、現用系計算機１０は、前記第１のＯＳ１のメモリ１２の内容を、例えばストレージ装置１５などに保存し、メモリダンプを完了する。 In step 111, the active computer 10 saves the contents of the memory 12 of the first OS 1 in, for example, the storage device 15 and completes the memory dump.

ステップ１１２で、現用系計算機１０はメモリダンプ完了後、第１のＯＳ１を再起動する。現用系計算機１０は、第１のＯＳを再起動した後は新たな待機系として機能しても良いし、あるいは、再起動させずに停止させても良い。 In step 112, the active computer 10 restarts the first OS 1 after the memory dump is completed. The active computer 10 may function as a new standby system after the first OS is restarted, or may be stopped without restarting.

ここで、待機系計算機２０が、最後にＨＢを受信してからＷＤＴ状態取得（１０７）を行うまでの時間Ｔｃ２０２は次のように求める。ＨＢの送信間隔をＴｉ２００、カーネルパニック（１０４）からＷＤＴ無効化（１０５）までの最大所要時間をＴｄ２０１とする（例えば、Ｔｄ２０１は数百ミリ秒である）。このとき、Ｔｃ＞Ｔｉ＋ＴｄかつＴｃ＜Ｔｆを満たす時間Ｔｃ２０２を待機系計算機２０のミドルウェア２に設定しておけばよい。また、最後にＨＢを受信してから業務引継ぎ（１０８）までの時間Ｔｆ２０３は、上記時間Ｔｃ２０２に所定値を加算してミドルウェア２に設定しておけばよい。 Here, the time Tc202 from when the standby computer 20 lastly receives HB until WDT state acquisition (107) is obtained is obtained as follows. The HB transmission interval is Ti200, and the maximum required time from kernel panic (104) to WDT invalidation (105) is Td201 (for example, Td201 is several hundred milliseconds). At this time, the time Tc 202 that satisfies Tc> Ti + Td and Tc <Tf may be set in the middleware 2 of the standby computer 20. The time Tf203 from the last reception of HB to the business handover (108) may be set in the middleware 2 by adding a predetermined value to the time Tc202.

以上の処理によって、現用系計算機１０及び待機系計算機２０は、前記従来例のように特別なハードウェアを必要とせずに、ＢＭＣを搭載した標準的な計算機を採用しながら、現用系計算機１０に異常が発生したときにはメモリダンプが強制的に停止されてしまうことを抑止できる。 As a result of the above processing, the active computer 10 and the standby computer 20 do not require special hardware as in the conventional example, and adopt a standard computer equipped with a BMC, while using the standard computer with the BMC. It is possible to prevent the memory dump from being forcibly stopped when an abnormality occurs.

そして、また、待機系計算機２０は、異常の発生によって停止した現用系計算機１０が第２のＯＳ３に移行するタイミングで現用系計算機の第１のＯＳ１の停止を検知できる。これにより、現用系計算機１０の異常による業務の停止から短時間で待機系計算機２０に業務を引き継がせることができる。さらに、現用系計算機１０及び待機系計算機２０では、ＯＳの改造や特殊なドライバは不要であるため、汎用ＯＳを利用することができる。 Further, the standby computer 20 can detect the stop of the first OS 1 of the active computer at the timing when the active computer 10 stopped due to the occurrence of abnormality shifts to the second OS 3. As a result, it is possible to cause the standby computer 20 to take over the work in a short time after the work is stopped due to an abnormality in the active computer 10. Further, since the active computer 10 and the standby computer 20 do not require modification of the OS or special drivers, a general-purpose OS can be used.

図３は、上記図２において、現用系計算機１０の第１のＯＳ１がカーネルパニックする代わりに、ハングアップ、例えばＯＳ内部処理の無限ループなどの無応答の障害が発生したケースにおけるタイムチャートである。各ステップにおける詳細は以下の通りである。 FIG. 3 is a time chart in the case where, in FIG. 2, the first OS 1 of the active computer 10 kernel panics, a non-response failure such as an infinite loop of internal processing of the OS occurs. . Details of each step are as follows.

ステップ１００〜１０３ｎは、上記図２のケースと同様である。 Steps 100 to 103n are the same as in the case of FIG.

ステップ１２０では、現用系計算機１０の第１のＯＳ１がハングアップし、定期的なＨＢの送信（１０２）が停止する。 In step 120, the first OS 1 of the active computer 10 hangs up and periodic HB transmission (102) stops.

ステップ１０７で、待機系計算機２０は最後にＨＢを受信してから所定時間Ｔｃ２０２経過すると、ネットワーク３１を介して現用系ＢＭＣ６に対してＷＤＴ状態取得コマンドを送信する。現用系ＢＭＣ６は、ＷＤＴ設定状態１７を応答する。待機系計算機２０は現用系ＢＭＣ６のＷＤＴ設定状態１７が「有効状態」のままであることを検知する。 In step 107, the standby computer 20 transmits a WDT state acquisition command to the active BMC 6 via the network 31 when a predetermined time Tc202 has elapsed since the last reception of HB. The working BMC 6 responds with the WDT setting state 17. The standby computer 20 detects that the WDT setting state 17 of the active BMC 6 remains “valid”.

ステップ１２１で、現用系計算機１０の第１のＯＳ１の異常停止から業務引継ぎまでの時間の要求仕様がＴｆ２０３とした場合、待機系計算機２０は最後にＨＢを受信してからＴｆ２０３時間経過したときに、前記ステップ１０７で取得したＷＤＴ設定状態１７を判定する。そして、ＷＤＴ設定状態１７が「有効状態」であった場合、現用系計算機１０の業務は稼働を続けている可能性があるため、待機系計算機２０はネットワーク３１を介して現用系ＢＭＣ６にＮＭＩの入力指示を送信し、現用系計算機１０における業務の停止とメモリダンプ処理の開始を指令する。 In step 121, if the required specification of the time from the abnormal stop of the first OS 1 of the active computer 10 to the business takeover is Tf203, when the standby computer 20 has elapsed Tf203 time since the last HB was received. The WDT setting state 17 acquired in step 107 is determined. If the WDT setting state 17 is “valid state”, there is a possibility that the operation of the active computer 10 may continue to operate, so the standby computer 20 sends an NMI to the active BMC 6 via the network 31. An input instruction is transmitted to instruct stop of work and start of memory dump processing in the active computer 10.

ステップ１０８で、現用系計算機１０へのＮＭＩの入力により現用系計算機１０の第１のＯＳ１が停止するので、待機系計算機２０は現用系計算機１０上で実行していた業務の引継ぎを行う。 In step 108, the first OS 1 of the active computer 10 is stopped due to the input of the NMI to the active computer 10, so that the standby computer 20 takes over the work being executed on the active computer 10.

ステップ１２２で、ＮＭＩの入力指示を受信した現用系ＢＭＣ６は、システムインタフェース３０を介して現用系計算機１０にＮＭＩを入力する。 In step 122, the active BMC 6 that has received the NMI input instruction inputs the NMI to the active computer 10 via the system interface 30.

ステップ１０４では、ＮＭＩを受信した現用系計算機１０が、ＮＭＩハンドリングとしてカーネルパニック処理を実施する。 In step 104, the active computer 10 that has received the NMI performs a kernel panic process as NMI handling.

ステップ１０５〜１０６では、上記図２と同様にして、現用系計算機１０がＷＤＴ１８の無効化コマンドを送信し、現用系ＢＭＣ６がＷＤＴ設定状態１７を「無効状態」に更新する。 In steps 105 to 106, as in FIG. 2, the active computer 10 transmits an invalidation command for the WDT 18, and the active BMC 6 updates the WDT setting state 17 to “invalid state”.

ステップ１２３ではＮＭＩの入力によって現用系計算機１０上で稼働する第１のＯＳ１が確実に停止しない場合も想定されるので、待機系計算機２０は、ＮＭＩ入力指示（１２１）から所定の（Ｔｃ−Ｔｉ）２１０時間が経過した後に再度、ＷＤＴ状態取得を行う。そして、待機系計算機２０は、再度取得したＷＤＴ設定状態１７が「無効状態」であることを判定しても良い。その場合、業務引き継ぎはステップ１０８では行わず、ステップ１２３においてＷＤＴ設定状態１７が「無効状態」であることを判定した後に行う。 In step 123, since it is also assumed that the first OS 1 running on the active computer 10 is not stopped due to the input of the NMI, the standby computer 20 receives a predetermined (Tc-Ti) from the NMI input instruction (121). ) After 210 hours have passed, WDT status acquisition is performed again. The standby computer 20 may determine that the WDT setting state 17 acquired again is an “invalid state”. In this case, the business handover is not performed in step 108 but is performed after it is determined in step 123 that the WDT setting state 17 is “invalid state”.

ステップ１０９〜１１２は、上記図２の処理と同様であり、現用系計算機１０が第２のＯＳ３を起動してメモリダンプを実行し、メモリダンプの完了後に第１のＯＳ１の再起動を実行する。 Steps 109 to 112 are the same as the processing in FIG. 2 described above. The active computer 10 starts the second OS 3 to execute a memory dump, and restarts the first OS 1 after the memory dump is completed. .

以上の処理により、本実施例１では、ＢＭＣを搭載した計算機および汎用ＯＳにより構成される多重系の計算機システムにおいて、現用系計算機１０で実施されるメモリダンプ処理を止めることなく、現用系計算機１０の第１のＯＳがハングアップなどによって異常停止したことを、待機系計算機２０で短時間で検知することができる。 Through the above processing, in the first embodiment, in the multi-computer system configured by the computer equipped with the BMC and the general-purpose OS, the active computer 10 can be used without stopping the memory dump process executed by the active computer 10. It can be detected in a short time by the standby computer 20 that the first OS has stopped abnormally due to a hang-up or the like.

さらに、障害が発生した現用系計算機１０は、メモリダンプ採取用の第２のＯＳ３を起動する前にＷＤＴ１８を無効化して、待機系計算機２０に現用系の業務が停止したことを通知することができる。これによって、現用系計算機１０の障害発生から、待機系計算機２０が業務を引きつぐまでの時間を短縮することができる。 Further, the active computer 10 in which the failure has occurred may invalidate the WDT 18 before starting the second OS 3 for collecting the memory dump, and notify the standby computer 20 that the operation of the active system has been stopped. it can. As a result, it is possible to shorten the time from the occurrence of a failure of the active computer 10 until the standby computer 20 takes over the work.

前記従来例のような機能拡張ボードやＢＭＣを使用しない場合では、メモリダンプ採取用の第２のＯＳ３を起動してからＬＡＮアダプタ１４等の初期化が完了するまでは、待機系計算機２０に現用系の業務が停止したことを通知できない。このため、現用系計算機１０で障害が発生してから、待機系計算機２０が現用系計算機１０の業務を引き継ぐまでの待ち時間が長くなる。 In the case where the function expansion board or BMC as in the conventional example is not used, the standby computer 20 is currently used until the initialization of the LAN adapter 14 and the like is completed after the second OS 3 for collecting the memory dump is started. Cannot notify that the work of the host has stopped. For this reason, the waiting time from when a failure occurs in the active computer 10 until the standby computer 20 takes over the work of the active computer 10 becomes long.

これに対して、本実施例１のように、第２のＯＳ３を起動する前にＷＤＴ１８を無効化しておくことで、待機系計算機２０は最後のＨＢを受信してから時間Ｔｃの経過後にＷＤＴ設定状態１７を取得して現用系の業務が停止したことを判定することができる。これにより、迅速なフェイルオーバを実現することができる。 On the other hand, as in the first embodiment, by disabling the WDT 18 before starting the second OS 3, the standby computer 20 receives the last HB and waits for the WDT after a lapse of time Tc. The setting state 17 can be acquired and it can be determined that the current service has stopped. Thereby, quick failover can be realized.

なお、上記実施例１では現用系計算機１０のミドルウェア２がＨＢを送信し、待機系計算機２０のミドルウェア５で現用系計算機１０の障害発生を監視する例を示したが、これらの処理をＯＳで行ってもよい。 In the first embodiment, the middleware 2 of the active computer 10 transmits HB, and the middleware 5 of the standby computer 20 monitors the occurrence of a failure in the active computer 10, but these processes are performed by the OS. You may go.

また、本実施例１では、現用系ＢＭＣ６に搭載されたＷＤＴ１８の状態を格納するＷＤＴ設定状態１７を設ける例を示したが、これに限定されるものではない。例えば、現用系計算機１０の実行に影響（意図しない作用）を与えることなく、現用系ＢＭＣ６等で「有効状態」と「無効状態」を設定可能であればよい。すなわち、現用系ＢＭＣ６のメモリ８に確保した領域や、演算装置１１のレジスタ等のリソースうち、現用系計算機１０に影響を与えずに利用可能なリソースで「有効状態」と「無効状態」を表せばよい。 Further, in the first embodiment, the example in which the WDT setting state 17 for storing the state of the WDT 18 mounted on the active BMC 6 is shown, but the present invention is not limited to this. For example, it is only necessary that the “active state” and the “invalid state” can be set by the active BMC 6 or the like without affecting the execution of the active computer 10 (unintended operation). That is, among the resources secured in the memory 8 of the active BMC 6 and the resources of the arithmetic unit 11, resources that can be used without affecting the active computer 10 can represent “valid” and “invalid”. That's fine.

本発明の第２の実施例は、前記実施例１の図２、３に示したメモリダンプ採取用の第２のＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）３の起動を開始した後に、第２のＯＳ３がハングアップした場合に、現用系計算機１０にハードリセットを入力し、第１のＯＳ１を再起動させるように変更したものである。これにより、異常発生によって停止した現用系計算機１０で、第２のＯＳ３に障害が発生しても新たな待機系として早期に復帰させることを保証できる。 In the second embodiment of the present invention, after starting the second OS (Operating System) 3 for collecting a memory dump shown in FIGS. 2 and 3 of the first embodiment, the second OS 3 hangs up. In this case, a hard reset is input to the active computer 10 and the first OS 1 is restarted. As a result, it is possible to ensure that the active computer 10 stopped due to the occurrence of abnormality can be quickly restored as a new standby system even if a failure occurs in the second OS 3.

なお、本実施例２の計算機システムの構成は前記実施例１と同様であり、前記実施例１では利用していなかったＷＤＴ１８の機能を使用する点が相違する。 The configuration of the computer system of the second embodiment is the same as that of the first embodiment, except that the function of the WDT 18 that is not used in the first embodiment is used.

図５は、本実施例２で現用系ＢＭＣ６のＷＤＴ１８を活用するにあたり、ＷＤＴ設定状態１７の状態遷移を簡単に示したものであり、より正確にはＩＰＭＩ（Intelligent Platform Management Interface）の仕様に準拠する。 FIG. 5 simply shows the state transition of the WDT setting state 17 when using the WDT 18 of the active BMC 6 in the second embodiment, and more accurately conforms to the specification of the IPMI (Intelligent Platform Management Interface). To do.

まずＷＤＴ設定状態１７が任意の状態のときに演算装置１９が初期設定５００を行い、有効状態５０１に遷移する。このとき、演算装置１９がＷＤＴ設定状態１７の設定項目の一部である初期カウンタをＴｔ、アクションをハードリセットに設定する。 First, when the WDT setting state 17 is an arbitrary state, the arithmetic unit 19 performs an initial setting 500 and makes a transition to the valid state 501. At this time, the arithmetic unit 19 sets the initial counter as a part of the setting items in the WDT setting state 17 to Tt and the action to hard reset.

次にＷＤＴ開始５０２（実際にはＷＤＴリセットと同じ）コマンドが現用系ＢＭＣ６に送信されると、ＷＤＴ設定状態１７は開始状態５０３となり、演算装置１９がＷＤＴ設定状態１７の一部である現在カウンタを前記初期カウンタの値Ｔｔにセットする。 Next, when a WDT start 502 (actually, the same as WDT reset) command is transmitted to the active BMC 6, the WDT setting state 17 becomes the start state 503, and the arithmetic unit 19 is a current counter that is a part of the WDT setting state 17. Is set to the value Tt of the initial counter.

ＷＤＴ設定状態１７が開始状態５０３にあるとき、予め設定された単位時間経過５０４すると、演算装置１９が前記現在カウンタを１ずつ減算する。 When the WDT setting state 17 is in the start state 503, when the preset unit time elapses 504, the arithmetic unit 19 decrements the current counter by one.

同じくＷＤＴ設定状態１７が開始状態５０３にあるとき、ＷＤＴリセット５０５コマンドが現用系ＢＭＣ６に送信されると、演算装置１９は前記現在カウンタの値を前記初期カウンタの値Ｔｔにセットし直す。 Similarly, when the WDT setting state 17 is in the start state 503, when a WDT reset 505 command is transmitted to the active BMC 6, the arithmetic unit 19 resets the value of the current counter to the value Tt of the initial counter.

さらにＷＤＴ設定状態１７が開始状態５０３にあるとき、前記現在カウンタが０になると、前記アクションの設定に従い当該現用系ＢＭＣ６を搭載する計算機にハードリセットを入力し、タイマ発火状態５０７に遷移する。これにより、現用系計算機１０はＷＤＴ１８のカウントアップによって所定のアクションであるハードリセットが入力されて、強制的に再起動する。 Further, when the WDT setting state 17 is in the start state 503, when the current counter becomes 0, a hard reset is input to the computer on which the active system BMC 6 is mounted according to the action setting, and the state transits to the timer firing state 507. As a result, the active computer 10 receives a hard reset as a predetermined action by counting up the WDT 18 and forcibly restarts.

以上を踏まえて、図４、図５を用いて本実施例２を実現する処理を説明する。図４は、メモリダンプ処理用の第２のＯＳ３がハングアップした場合に、自動的に第１のＯＳ１を再起動させる処理の一例を示すタイムチャートである。図４の処理は、前記実施例１の図２のケースに対し、本実施例２を適用するものであるが、前記実施例１の図３のケースについても同様に適用することができる。各ステップの詳細は以下の通りである。 Based on the above, processing for realizing the second embodiment will be described with reference to FIGS. 4 and 5. FIG. 4 is a time chart showing an example of processing for automatically restarting the first OS 1 when the second OS 3 for memory dump processing hangs up. The processing in FIG. 4 applies the second embodiment to the case of FIG. 2 of the first embodiment, but can be similarly applied to the case of FIG. 3 of the first embodiment. Details of each step are as follows.

ステップ１３０では、現用系計算機１０は業務を開始する前に、初期設定としてＷＤＴ１８を有効状態５０１にするコマンドを現用系ＢＭＣ６に送信する。ここではＷＤＴ１８を有効状態５０１にするだけであり、タイマ（現在カウンタ）は開始させない。 In step 130, the active computer 10 transmits to the active BMC 6 a command for setting the WDT 18 to the valid state 501 as an initial setting before starting the business. Here, the WDT 18 is merely set to the valid state 501 and the timer (current counter) is not started.

ステップ１３１では、現用系ＢＭＣ６がＷＤＴ１８の有効化コマンドを受信し、ＷＤＴ設定状態１７を有効状態５０１に設定する。 In step 131, the working BMC 6 receives the WDT 18 validation command and sets the WDT setting state 17 to the valid state 501.

ステップ１０２〜１０４は、前記実施例１の図２と同様であり、ＨＢの送信後に現用系計算機１０の第１のＯＳ１が異常の発生によってカーネルパニックとなる。 Steps 102 to 104 are the same as in FIG. 2 of the first embodiment. After the HB is transmitted, the first OS 1 of the active computer 10 becomes a kernel panic due to the occurrence of an abnormality.

ステップ１３２では、メモリダンプ採取用の第２のＯＳ３を起動する前のＯＳ非依存処理において、ＷＤＴ開始５０２コマンドを、システムインタフェース３０を介して現用系ＢＭＣ６に送信する。本実施例２も前記実施例１と同様に、第２のＯＳ３を起動する前に、現用系ＢＭＣ６に対して設定の変更を指令する。 In step 132, a WDT start 502 command is transmitted to the active BMC 6 via the system interface 30 in the OS-independent process before starting the second OS 3 for collecting the memory dump. Similarly to the first embodiment, the second embodiment also instructs the active BMC 6 to change the settings before starting the second OS 3.

ステップ１３３では、現用系ＢＭＣ６は、ＷＤＴ設定状態１７を開始状態５０３に更新する。 In step 133, the active BMC 6 updates the WDT setting state 17 to the start state 503.

ステップ１０７〜１０９は、前記実施例１の図２と同様であり、更新されたＷＤＴ設定状態１７を取得した待機系計算機２０は、現用系計算機１０の業務を引き継ぎ、現用系計算機１０では第２のＯＳ３を起動する。 Steps 107 to 109 are the same as in FIG. 2 of the first embodiment, and the standby computer 20 that has acquired the updated WDT setting state 17 takes over the work of the active computer 10, and the active computer 10 Start OS3.

ステップ１３４〜１３４ｎでは、第２のＯＳ３が起動した後、およびメモリダンプ最中において、現用系計算機１０は予め設定された一定の周期でＷＤＴリセット５０５コマンドを、システムインタフェース３０を介して現用系ＢＭＣ６に送信する。 In steps 134 to 134n, after the second OS 3 is activated and during the memory dump, the active computer 10 sends a WDT reset 505 command to the active BMC 6 via the system interface 30 at a predetermined period. Send to.

ステップ１３５〜１３４ｎでは、現用系ＢＭＣ６は、ＷＤＴリセット５０５コマンドを受け取り、前記現在カウンタを前記初期カウンタ値Ｔｔにリセットする。そして、現用系ＢＭＣ６では現在カウンタのカウントを上述のように繰り返す。 In steps 135 to 134n, the working BMC 6 receives the WDT reset 505 command and resets the current counter to the initial counter value Tt. Then, the working BMC 6 repeats the counting of the current counter as described above.

ステップ１１０では、現用系計算機１０は、第２のＯＳ３の起動処理が完了すると、メモリダンプ処理を開始する。 In step 110, the active computer 10 starts the memory dump process when the startup process of the second OS 3 is completed.

ステップ１３６では、何らかの原因で、第２のＯＳ３のメモリダンプ処理がハングアップする。ステップ１３７では、最後のＷＤＴリセット１３５ｎから所定の時間Ｔｔが経過した後に、前記現在カウンタが０になり、タイマが発火する。現用系ＢＭＣ６は、ＷＤＴ設定状態１７を発火状態５０７に設定し、前記アクションに設定された通り、現用系計算機１０にハードリセットを入力する。 In step 136, the memory dump process of the second OS 3 hangs up for some reason. In step 137, after a predetermined time Tt has elapsed since the last WDT reset 135n, the current counter becomes 0 and the timer fires. The working BMC 6 sets the WDT setting state 17 to the firing state 507, and inputs a hard reset to the working computer 10 as set in the action.

ステップ１３８では、ステップ１３７のハードリセット入力により、現用系計算機１０は強制的に再起動される。これにより、現用系計算機１０は新たな待機系計算機として復帰することができる。 In step 138, the active computer 10 is forcibly restarted by the hard reset input in step 137. As a result, the active computer 10 can be restored as a new standby computer.

なお、第２のＯＳ３のハングアップ（１３６）は、ステップ１０９およびそれ以降のタイミングで起こる場合も同様である。 Note that the hang-up (136) of the second OS 3 is the same when it occurs at the timing of step 109 and later.

以上の処理により、本実施例２ではさらに、第１のＯＳ１が停止した後の現用系計算機１０で実施される第２のＯＳ３によるメモリダンプ処理において、第２のＯＳ３の起動からメモリダンプ完了までの間にハングアップした場合であっても、待機系計算機２０に業務を引き継がせて、予測可能な時間内に現用系計算機１０を再起動して新たな待機系として復帰させることが可能となる。 With the above processing, in the second embodiment, in the memory dump processing by the second OS 3 performed by the active computer 10 after the first OS 1 is stopped, from the start of the second OS 3 to the completion of the memory dump. Even if a hang-up occurs during this period, the standby computer 20 can take over the work, and the active computer 10 can be restarted and restored as a new standby system within a predictable time. .

また、上記実施例１、２ではＮＭＩやハードリセットを現用系計算機１０へ入力する装置として現用系ＢＭＣ６を用いる例を示したが、これに限定されるものではない。例えば、現用系計算機１０にＮＭＩやハードリセットを入力可能な計算機管理装置であれば良い。 In the first and second embodiments, the example in which the active BMC 6 is used as a device for inputting the NMI and the hard reset to the active computer 10 is shown, but the present invention is not limited to this. For example, any computer management device that can input an NMI or a hard reset to the active computer 10 may be used.

また、上記実施例１、２では待機系計算機２０が現用系ＢＭＣ６のＷＤＴ設定状態１７を参照して、現用系計算機１０で業務が確実に停止したことを判定したが、ＷＤＴ設定状態１７に限定されるものではない。例えば、現用系計算機１０が業務を開始するとき（または開始前）にミドルウェア２または第１のＯＳが、現用系ＢＭＣ６または計算機管理装置の業務の開始を示す業務稼働状態に「開始」（＝ＷＤＴ有効）を設定する。 Further, in the first and second embodiments, the standby computer 20 refers to the WDT setting state 17 of the active BMC 6 and determines that the operation is reliably stopped in the active computer 10. However, the standby computer 20 is limited to the WDT setting state 17. Is not to be done. For example, when the active computer 10 starts a business (or before the start), the middleware 2 or the first OS “start” (= WDT) to the business operation state indicating the start of the business of the active BMC 6 or the computer management apparatus. Set Enabled.

そして、第１のＯＳがカーネルパニックとなって、第２のＯＳ３を起動する前に、所定のＩ／Ｏ操作によって業務の停止を示す状態を現用系ＢＭＣ６または現用系の計算機管理装置の業務稼働状態に「停止」（＝ＷＤＴ無効）を設定する。待機系計算機２０は、最後のＨＢから所定時間Ｔｃの経過後に、現用系ＢＭＣ６または計算機管理装置の業務稼働状態を取得して、「停止」であれば現用系計算機１０の業務が確実に停止したと判定して業務を引き継ぐことができる。 Then, before the first OS becomes a kernel panic and the second OS 3 is started, the status indicating that the task has been stopped by a predetermined I / O operation indicates the operation status of the active BMC 6 or the active computer management device. Set “Stop” (= WDT invalid) to the state. The standby computer 20 acquires the operational status of the active BMC 6 or the computer management device after the elapse of a predetermined time Tc from the last HB, and if it is “stopped”, the operation of the active computer 10 is reliably stopped. It can be determined that the business can be taken over.

このように、本発明では、現用系計算機１０に接続された計算機管理装置に、業務稼働状態の「開始」または「停止」が設定され、待機系計算機２０は、障害の発生を検知したときに業務稼働状態が「停止」であれば、現用系計算機１０の業務を引き継いで、迅速なフェイルオーバを実現できる。 As described above, according to the present invention, “start” or “stop” of the business operation state is set in the computer management apparatus connected to the active computer 10, and the standby computer 20 detects the occurrence of a failure. If the business operation state is “stopped”, the business of the active computer 10 can be taken over and a quick failover can be realized.

また、本実施例１、２では、通信装置１３が待機系計算機２０に所定の周期で送信するＨＢと、現用系計算機１０がＷＤＴ設定状態１７に設定する状態を用いる例を示したが、これに限定されるものではない。例えば、現用系計算機１０が稼働していることを示す第１の稼働情報と、現用系計算機１０の第１のＯＳ１が停止したことを示す第２の稼働情報を計算機管理装置に設けてもよい。この場合、待機系計算機２０は、最後の第１の稼働情報を受信したときから所定時間後に、計算機管理装置から第２の稼働情報を取得することで、第１のＯＳ１に障害が発生した後に、第１のＯＳ１が確実に停止ししたことを判定することができる。 In the first and second embodiments, the example in which the communication device 13 uses the HB that is transmitted to the standby computer 20 at a predetermined period and the state in which the active computer 10 is set to the WDT setting state 17 has been described. It is not limited to. For example, first operation information indicating that the active computer 10 is operating and second operation information indicating that the first OS 1 of the active computer 10 has been stopped may be provided in the computer management apparatus. . In this case, the standby computer 20 acquires the second operation information from the computer management apparatus after a predetermined time from the reception of the last first operation information, thereby causing a failure in the first OS 1. It can be determined that the first OS 1 has stopped reliably.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に記載したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加、削除、又は置換のいずれもが、単独で、又は組み合わせても適用可能である。 In addition, this invention is not limited to an above-described Example, Various modifications are included. For example, the above-described embodiments are described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. Further, a part of the configuration of one embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of one embodiment. In addition, any of the additions, deletions, or substitutions of other configurations can be applied to a part of the configuration of each embodiment, either alone or in combination.

また、上記の各構成、機能、処理部、及び処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、及び機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記録装置、または、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。 Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. In addition, each of the above-described configurations, functions, and the like may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files for realizing each function can be stored in a memory, a recording device such as a hard disk or SSD (Solid State Drive), or a recording medium such as an IC card, SD card, or DVD.

また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 Further, the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

６現用系ＢＭＣ
７待機系ＢＭＣ
１０現用系計算機
１１演算装置
１２メモリ
１３通信装置
１４ＬＡＮアダプタ
１５ストレージ装置
１７ＷＤＴ設定状態
１８ウォッチドッグタイマ
２０待機系計算機
３０システムインタフェース
３１、３２ネットワーク 6 Active BMC
7 Standby BMC
DESCRIPTION OF SYMBOLS 10 Active computer 11 Arithmetic unit 12 Memory 13 Communication device 14 LAN adapter 15 Storage device 17 WDT setting state 18 Watchdog timer 20 Standby system computer 30 System interface 31, 32 Network

Claims

A plurality of computers each having a processor and a memory; at least one of the plurality of computers is an active computer; at least one of the other computers is a standby computer; and the active computer and the standby computer A computer system having a first network for connecting
The working computer is
A computer management device that is connected to the active computer and manages the active computer;
A first OS that provides business;
A communication unit that transmits a heartbeat indicating that the first OS is in operation to the standby computer at a predetermined period ;
A second OS that outputs the contents of the memory after a failure has occurred in the first OS;
The computer management device is
A watchdog timer that executes a predetermined action when counted up to a second predetermined time;
Timer setting state information for setting the operating state of the watchdog timer ;
An adapter connected to the standby computer via a second network ,
The working computer is
Set a valid state indicating the start of the work in the timer setting state information of the computer management device at the start of the work, set a hard reset of the active computer as the action,
The processor of the working computer is
When the first OS is stopped due to a failure, a start command for starting the count of the watchdog timer is transmitted to the computer management apparatus, and then the second OS is started.
The computer management device is
Receiving the start command and updating the timer setting state information to a start state;
The standby computer is
After the first predetermined time has elapsed since the heartbeat was received, the timer setting state information is acquired, and if the timer setting state information is the start state , it is determined that the first OS has stopped. To take over the work of the active computer,
The second OS of the active computer is
Resetting the watchdog timer every third predetermined time;
The computer management device is
The computer system according to claim 1, wherein the hard reset set as the predetermined action is input to the active computer when counting up to the second predetermined time .

  The computer system according to claim 1,
  The second OS is
  A computer system that performs a dump of the memory of the active computer in which the failure has occurred.

  A plurality of computers each having a processor and a memory; at least one of the plurality of computers is an active computer; at least one of the other computers is a standby computer; and the active computer and the standby computer A computer network control method for switching the operation provided by the active computer to the standby computer,
  The working computer is
  A watchdog timer connected to the active computer and executing a predetermined action when counted up to a second predetermined time, timer setting state information for setting an operating state of the watchdog timer, and a second network An adapter connected to the standby computer, and a computer management device for managing the active computer,
  A first OS providing the business;
  A communication unit that transmits a heartbeat indicating that the first OS is in operation to the standby computer at a predetermined period;
  A second OS that outputs the contents of the memory after a failure has occurred in the first OS;
  The control method is:
  A first step in which the active computer operates the first OS that provides the business;
  The active computer sets a valid state indicating the start of the task in the timer setting state information of the computer management apparatus that manages the active computer, and sets a hard reset of the active computer as the action. Two steps,
  A third step in which the active computer transmits the heartbeat indicating that the first OS is in operation to the standby computer via a network at a predetermined period;
  A fourth step in which the processor of the active computer transmits a start command to start counting of the watchdog timer to the computer management device when the first OS is stopped due to a failure;
  A fifth step in which the computer management apparatus receives the start command and updates the timer setting state information to a start state;
  A sixth step in which the active computer starts up the second OS;
  The standby computer acquires the timer setting state information after a first predetermined time has elapsed after receiving the heartbeat, and determines whether the timer setting state information is the start state. A seventh step;
  The standby computer, if the result of the determination is the start state, determines that the first OS has stopped, an eighth step of taking over the work of the active computer;
  A ninth step in which the second OS of the active computer resets the watchdog timer every third predetermined time;
  A tenth step of inputting the hard reset set as the predetermined action to the active computer when the watchdog timer of the computer management device counts up to the second predetermined time;
A control method for a computer system, comprising:

  A control method for a computer system according to claim 3,
  The sixth step includes
  A computer system control method comprising: dumping the memory of the active computer in which the failure has occurred.