JP2008305353A

JP2008305353A - Cluster system and fail-over method

Info

Publication number: JP2008305353A
Application number: JP2007154481A
Authority: JP
Inventors: Hideji Nishijima; 英児西島; Koji Fujimoto; 幸治藤本
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2007-06-11
Filing date: 2007-06-11
Publication date: 2008-12-18

Abstract

<P>PROBLEM TO BE SOLVED: To control the execution of fail-over in accordance with a processing state of a program. <P>SOLUTION: A server computer 102 constituting a cluster system 101 is provided with a service providing part 112 for providing a service to client computers 104 to 106, a cluster processing part 111 for executing fail-over when detecting operation stop of other server computer 102 or when receiving an instruction to execute fail-over, and a service providing part state monitoring part 121 for monitoring a processing state of the service providing part 112. The service providing part state monitoring part 121 periodically determines whether a processing time of the service providing part 112 exceeds a predetermined critical processing time, and instructs the cluster processing part 111 to execute fail-over when it is determined that the processing time exceeds the predetermined critical processing time. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、フェイルオーバの実行を制御する技術に関する。 The present invention relates to a technique for controlling the execution of failover.

コンピュータシステムの信頼性や処理性能を向上するためのシステムとして、複数の計算機を疎結合したクラスタシステムがある。クラスタシステムとして、例えば、サーバ計算機１台あたりの処理負荷を低減する負荷分散型や、業務アプリケーションが動作するサーバ計算機（現用系サーバ）で障害が発生した場合に、当該サーバ計算機上の業務アプリケーション等のリソースを他のサーバ計算機（待機系サーバ）に引き継いで業務を継続させるフェイルオーバ型が知られている。例えば、特許文献１には、クラスタシステムにおいて、引き継ぐリソースとフェイルオーバ先で動作しているリソースの実行優先度に応じて、リソースの実行優先度を変更してフェイルオーバを実行する方法について記載されている。 As a system for improving the reliability and processing performance of a computer system, there is a cluster system in which a plurality of computers are loosely coupled. As a cluster system, for example, a load distribution type that reduces the processing load per server computer, or a business application on the server computer when a failure occurs in a server computer (active server) on which the business application operates A failover type is known in which resources are transferred to another server computer (standby server) to continue the business. For example, Patent Document 1 describes a method for executing failover by changing the execution priority of a resource in accordance with the execution priority of a resource that is taken over and a resource that is operating at a failover destination in a cluster system. .

特開平１１−３５３２９２号公報JP-A-11-353292

さて、特許文献１の様な従来のフェイルオーバ型のクラスタシステムでは、例えば、サーバ計算機は、ハートビート通信により監視対象の他のサーバ計算機の正常稼動を確認し、監視対象のサーバ計算機の停止、すなわち、ハードウェア障害等によるダウンを検出した場合にフェイルオーバを実行する。 In a conventional failover type cluster system such as Patent Document 1, for example, a server computer confirms normal operation of another server computer to be monitored by heartbeat communication, and stops the server computer to be monitored, that is, Failover is executed when a down due to a hardware failure is detected.

しかしながら、上記のクラスタシステムでは、サーバ計算機上で稼動するプログラムの処理状況は、監視対象となっていない。すなわち、ＣＰＵやメモリ等のリソース不足やソフトウェア障害等によりプログラムの応答性が低下した場合に、フェイルオーバは実行されない。そのため、当該クラスタシステムは、クライアントに対するサービス提供の応答性を即座に正常な状態に回復することができない。 However, in the above cluster system, the processing status of the program running on the server computer is not monitored. In other words, failover is not executed when program responsiveness is reduced due to a shortage of resources such as CPU or memory or a software failure. Therefore, the cluster system cannot immediately recover the responsiveness of service provision to the client to a normal state.

そこで、本発明は、クラスタシステムにおいて、ハードウェア障害だけでなくプログラムの処理状況に応じてフェイルオーバの実行を制御する技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a technique for controlling the execution of failover in accordance with not only hardware failure but also processing status of a program in a cluster system.

上記問題を解決するために、第１の態様によれば、フェイルオーバの実行指示を受けることにより、フェイルオーバ処理を行うフェイルオーバ処理部を各々備える複数のサーバを有するクラスタシステムを構成するサーバであって、フェイルオーバを行う処理状況の状態を特定する状態情報を記憶する記憶部と、サービスを提供する際の処理状況を特定する状況監視部と、前記状況監視部が特定した処理状況が、前記状態情報で特定される処理状況に一致する場合に、前記フェイルオーバ処理部にフェイルオーバの実行を指示するフェイルオーバ実行指示部と、を有する。 In order to solve the above problem, according to the first aspect, a server constituting a cluster system having a plurality of servers each including a failover processing unit that performs failover processing by receiving an instruction to execute failover, A storage unit for storing state information for specifying a state of a processing state for performing a failover, a state monitoring unit for specifying a processing state at the time of providing a service, and a processing state specified by the state monitoring unit are the state information. A failover execution instructing unit for instructing the failover processing unit to execute failover when the specified processing status is matched.

また、前記記憶部には、フェイルオーバを行う条件を特定する条件情報が前記状態情報毎に対応されて記憶されており、前記状況監視部が特定した処理状況が、前記状態情報で特定される処理状況に一致し、前記状況監視部が特定した処理状況が、一致すると判断された状態情報に対応する前記条件情報を満たす場合に、前記フェイルオーバ実行指示部は、前記フェイルオーバ処理部にフェイルオーバの実行を指示する。 Further, the storage unit stores condition information for specifying a condition for performing a failover in correspondence with each state information, and the processing status specified by the status monitoring unit is a process specified by the status information. When the processing status that matches the status and the processing status specified by the status monitoring unit satisfies the condition information corresponding to the status information determined to match, the failover execution instruction unit causes the failover processing unit to execute failover. Instruct.

本発明の実施の形態について図面を参照して説明する。 Embodiments of the present invention will be described with reference to the drawings.

以下、第１の実施形態について説明する。第１の実施形態では、ネットワークブートシステムに本発明を適用する構成を例に説明する。先ず、ネットワークブートシステムの概略を説明する。 The first embodiment will be described below. In the first embodiment, a configuration in which the present invention is applied to a network boot system will be described as an example. First, an outline of the network boot system will be described.

ネットワークブートシステムでは、クライアント計算機が起動するために必要なＯＳやアプリケーションプログラム（以下、アプリケーションと呼ぶ。）等のファイル群から構成される起動イメージは、サーバ計算機上に格納される。そして、クライアント計算機は、その起動時に、ネットワークを経由してサーバ計算機上の起動イメージを読み込んでブートを実行する。 In the network boot system, a boot image composed of a file group such as an OS and an application program (hereinafter referred to as an application) necessary for booting the client computer is stored on the server computer. When the client computer is activated, the client computer reads the activation image on the server computer via the network and executes booting.

１台のサーバ計算機で構成されるネットワークブートシステムにおいて、そのサーバ計算機がハードウエア障害またはソフトウエア障害によってダウンした場合、ネットワークブートのサービスを提供できなくなる。すなわち、クライアント計算機は、ネットワーク経由でブート及び起動されることができない。この結果、ユーザは、クライアントの計算機を使用できなくなる。また、そのサーバ計算機上の、ネットワークブートのサービスを提供するためのサーバプログラムの応答性低下が発生した場合、クラアント計算機は、ネットワークブートによる起動時間が長くなり、タイムアウト等によって起動ができなくなることがある。この結果、ユーザは、即座にクラアント計算機を使用開始できなくなる。 In a network boot system composed of one server computer, if the server computer goes down due to a hardware failure or a software failure, the network boot service cannot be provided. That is, the client computer cannot be booted and activated via the network. As a result, the user cannot use the client computer. In addition, if the responsiveness of the server program for providing the network boot service on the server computer is reduced, the client computer may have a longer boot time due to the network boot and may not start due to a timeout or the like. is there. As a result, the user cannot immediately start using the client computer.

ネットワークブートの方法としては、例えば、ＰＸＥ（ＰｒｅｂｏｏｔＥｘｃｅｃｕｔｉｏｎＥｎｖｉｒｏｎｍｅｎｔ）ブートが知られている。ＰＸＥブートでは、サーバ計算機で、クライアント計算機毎の起動イメージの設定が行われ、また、その起動イメージが格納される。クライアント計算機は、そのブート時に、自身のネットワークカードのＭＡＣアドレスをサーバ計算機に対して送信する。そして、サーバ計算機は、受信したＭＡＣアドレスに対応する起動イメージをクライアント計算機に送信する。 As a network boot method, for example, PXE (Preboot Execution Environment) boot is known. In PXE boot, a server computer sets a boot image for each client computer, and stores the boot image. At the time of booting, the client computer transmits the MAC address of its own network card to the server computer. Then, the server computer transmits an activation image corresponding to the received MAC address to the client computer.

さて、図１は、第１の実施形態が適用されるシステム構成、及び、サーバ計算機の機能構成を示すブロック図である。なお、サーバ計算機１０２ｂは、サーバ計算機１０２ａと同様の機能構成を有するため、機能構成を図示しない。また、サーバ計算機１０２ａ及びサーバ計算機１０２ｂを区別しない場合は単にサーバ計算機１０２と表記する。 FIG. 1 is a block diagram showing a system configuration to which the first embodiment is applied and a functional configuration of the server computer. Since the server computer 102b has the same functional configuration as the server computer 102a, the functional configuration is not shown. Further, when the server computer 102a and the server computer 102b are not distinguished, they are simply expressed as the server computer 102.

本図に示すように、本システムでは、クラスタシステム１０１と、クライアント計算機１０４、１０５及び１０６とが、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）１０７を介して接続される。もちろん、接続方法は、ＬＡＮに限られず、他の方法であってもよい。 As shown in this figure, in this system, a cluster system 101 and client computers 104, 105 and 106 are connected via a LAN (Local Area Network) 107. Of course, the connection method is not limited to the LAN, and other methods may be used.

クラスタシステム１０１は、サーバ計算機１０２ａとサーバ計算機１０２ｂから構成され、二重化されている。すなわち、サーバ計算機１０２ａ及びサーバ計算機１０２ｂいずれか一方が現用系として動作し、他方が待機系として動作する。サーバ計算機１０２ａとサーバ計算機１０２ｂは、ＬＡＮ１０８により接続され、ＬＡＮ１０８を介してハートビート通信を行って互いに稼動状況を監視する。もちろん、サーバ計算機１０２ａとサーバ計算機１０２ｂとの間の接続方法は、ＬＡＮに限られず、例えば、シリアル通信であってもよい。 The cluster system 101 is composed of a server computer 102a and a server computer 102b, and is duplicated. That is, one of the server computer 102a and the server computer 102b operates as an active system, and the other operates as a standby system. The server computer 102a and the server computer 102b are connected by a LAN 108 and perform heartbeat communication via the LAN 108 to monitor the operation status of each other. Of course, the connection method between the server computer 102a and the server computer 102b is not limited to LAN, and may be, for example, serial communication.

サーバ計算機１０２上には、制御部１５０及び記憶部１５２が構築される。制御部１５０は、クラスタ処理部１１１と、サービス提供部１１２と、サービス提供部状況監視部１２１とを備える。また、記憶部１５２は、サービス提供部動作履歴テーブル１１３と、サービス提供部特性テーブル１２２と、サービス提供部状態テーブル１２３と、フェイルオーバ指示ルールテーブル１２４とを備える。 On the server computer 102, a control unit 150 and a storage unit 152 are constructed. The control unit 150 includes a cluster processing unit 111, a service providing unit 112, and a service providing unit status monitoring unit 121. The storage unit 152 includes a service providing unit operation history table 113, a service providing unit characteristic table 122, a service providing unit state table 123, and a failover instruction rule table 124.

上記の機能で構成されるサーバ計算機１０２は、例えば、図７に示すような、ＣＰＵ１１と、主記憶装置１２と、ハードディスク等の補助記憶装置１３と、ＣＤ−ＲＯＭ等の可搬性を有する可搬型記憶媒体の情報を読み出す読み取り装置１６と、キーボードやマウス等の入力装置１４と、ディスプレイ等の出力装置１５と、外部装置と通信を行うための通信装置１７と、を備えた一般的なコンピュータにより実現可能である。 The server computer 102 configured with the above functions is, for example, a CPU 11, a main storage device 12, an auxiliary storage device 13 such as a hard disk, and a portable type such as a CD-ROM as shown in FIG. A general computer including a reading device 16 that reads information from a storage medium, an input device 14 such as a keyboard and a mouse, an output device 15 such as a display, and a communication device 17 for communicating with an external device. It is feasible.

例えば、制御部１５０や記憶部１５２は、ＣＰＵ１１が補助記憶装置１３に予め記憶されている所定のプログラムを主記憶装置１２にロードして実行することにより達成される。記憶部１５２は、データを一時的に保持する場合は主記憶装置１２を利用して、継続的に保持する場合は補助記憶装置１３を利用して実現される。プログラムは、読み取り装置１６を介して可搬型記憶媒体から、あるいは、通信装置１７を介してネットワークから、補助記憶装置１３にダウンロードされ、それから、主記憶装置１２上にロードされてＣＰＵ１１により実行されるようにしてもよい。また、読み取り装置１６を介して可搬型記憶媒体から、あるいは、通信装置１７を介してネットワークから、主記憶装置１２上に直接ロードされ、ＣＰＵ３０１により実行されるようにしてもよい。 For example, the control unit 150 and the storage unit 152 are achieved by the CPU 11 loading a predetermined program stored in advance in the auxiliary storage device 13 into the main storage device 12 and executing it. The storage unit 152 is realized using the main storage device 12 when temporarily holding data, and using the auxiliary storage device 13 when holding data continuously. The program is downloaded from the portable storage medium via the reading device 16 or from the network via the communication device 17 to the auxiliary storage device 13, and then loaded onto the main storage device 12 and executed by the CPU 11. You may do it. Alternatively, the program may be directly loaded on the main storage device 12 from a portable storage medium via the reading device 16 or from a network via the communication device 17 and executed by the CPU 301.

図１に戻って、クラスタ処理部１１１は、現用系及び待機系のサーバ計算機１０２上でそれぞれ動作し、ＬＡＮ１０８を介して互いにハートビート通信を行う。待機系のクラスタ処理部１１１は、現用系のクラスタ処理部１１１からの生存信号が途絶えたことを検出して、フェイルオーバを実行する。また、現用系のクラスタ処理部１１１は、現用系のサービス提供部状況監視部１２１の指示を受けて、待機系のクラスタ処理部１１１にフェイルオーバを実行させる。 Returning to FIG. 1, the cluster processing unit 111 operates on each of the active and standby server computers 102 and performs heartbeat communication with each other via the LAN 108. The standby cluster processing unit 111 detects that the survival signal from the active cluster processing unit 111 has been interrupted, and performs failover. Also, the active cluster processing unit 111 causes the standby cluster processing unit 111 to execute a failover in response to an instruction from the active service providing unit status monitoring unit 121.

以下、クラスタ処理部１１１の動作について、サーバ計算機１０２ａが現用系、サーバ計算機１０２ｂが待機系である場合を例にとって具体的に説明する。 Hereinafter, the operation of the cluster processing unit 111 will be specifically described by taking as an example the case where the server computer 102a is the active system and the server computer 102b is the standby system.

この場合、クラスタ処理部１１１の制御により、サーバ計算機１０２ａ上のサービス提供部１１２は動作中であり、一方、サーバ計算機１０２ｂ上のサービス提供部１１２は停止した状態である。サービス提供部１１２については後述する。また、サーバ計算機１０２ａ上のクラスタ処理部１１１とサーバ計算機１０２ｂ上のクラスタ処理部１１１は、ＬＡＮ１０８を介してハートビート通信を行い、互いのサーバ計算機の稼動状況を監視している。 In this case, under the control of the cluster processing unit 111, the service providing unit 112 on the server computer 102a is operating, while the service providing unit 112 on the server computer 102b is stopped. The service providing unit 112 will be described later. In addition, the cluster processing unit 111 on the server computer 102a and the cluster processing unit 111 on the server computer 102b perform heartbeat communication via the LAN 108, and monitor the operation status of each other's server computer.

ここで、例えば、サーバ計算機１０２ａがハードウェア障害により停止した場合、サーバ計算機１０２ｂ上のクラスタ処理部１１１は、サーバ計算機１０２ａからの生存信号が途絶えたことを検出する。そして、サーバ計算機１０２ａをダウンとみなしてリソースの引継ぎ（フェイルオーバ）を実行する。引き継がれるリソースは、サーバ計算機１０２ａの、サービス提供部１１２がその処理に使用しているデータ、ＬＡＮ１０７側のエイリアスＩＰアドレス、及び、サービス提供部１１２の処理などである。 Here, for example, when the server computer 102a is stopped due to a hardware failure, the cluster processing unit 111 on the server computer 102b detects that the survival signal from the server computer 102a has been interrupted. Then, the server computer 102a is regarded as down, and resource takeover (failover) is executed. The resources taken over include the data used by the service providing unit 112 of the server computer 102a for processing, the alias IP address on the LAN 107 side, the processing of the service providing unit 112, and the like.

ＩＰアドレスの引継ぎは、サーバ計算機１０２ｂ上のクラスタ処理部１１１が、サーバ計算機１０２ａのＬＡＮ１０７側のネットワークインタフェースのエイリアスＩＰアドレスと同一のエイリアスＩＰアドレスを、サーバ計算機１０２ｂのＬＡＮ１０７側のネットワークインタフェースの設定に追加することによって行われる。エイリアスＩＰアドレスとはＩＰアドレスの別名となるＩＰアドレスであり追加や変更がしやすい。サービス提供部１１２の処理の引継ぎは、サーバ計算機１０２ｂ上のクラスタ処理部１１１が、サーバ計算機１０２ｂ上のサービス提供部１１２を起動することによって行われる。また、データの引継ぎは、例えば、サーバ計算機１０２がそれぞれ接続された共有記憶装置等を介して行われる。 In taking over the IP address, the cluster processing unit 111 on the server computer 102b uses the same alias IP address as the network interface alias on the LAN 107 side of the server computer 102a to set the network interface on the LAN 107 side of the server computer 102b. Done by adding. An alias IP address is an IP address that is an alias for an IP address and is easy to add or change. The processing of the service providing unit 112 is taken over when the cluster processing unit 111 on the server computer 102b activates the service providing unit 112 on the server computer 102b. In addition, data transfer is performed, for example, via a shared storage device to which the server computer 102 is connected.

サーバ計算機１０２ａからサーバ計算機１０２ｂへリソースの引継ぎが行われた後、サーバ計算機１０２ｂは待機系から現用系へと切り替わる。サーバ計算機１０２ａは停止したままであるが、停止から復帰した場合には現用系から待機系へと切り替わる。 After the handover of resources from the server computer 102a to the server computer 102b, the server computer 102b switches from the standby system to the active system. The server computer 102a remains stopped, but switches from the active system to the standby system when returning from the stop.

以上のように、現用系のサーバ計算機１０２ａが停止した場合、サーバ計算機１０２ｂがリソースの引継ぎを行って待機系から現用系として切り替わる。そして、起動されたサービス提供部１１２は、引き継がれたデータを用いて、ＬＡＮ１０７を介して、クライアント計算機１０４〜１０６へのサービス提供を開始する。クラアント計算機１０４〜１０６は、同一のＩＰアドレスを用いてサーバ計算機１０２ｂにアクセスできるため、フェイルオーバの実行に係らず、通常通りにサービス提供部１１２によるサービスの提供を受けられる。 As described above, when the active server computer 102a stops, the server computer 102b takes over resources and switches from the standby system to the active system. Then, the activated service providing unit 112 starts providing services to the client computers 104 to 106 via the LAN 107 using the inherited data. Since the client computers 104 to 106 can access the server computer 102b using the same IP address, the service providing unit 112 can receive services as usual regardless of the execution of failover.

サービス提供部１１２は、例えば、サーバプログラム、アプリケーションプログラムや業務処理プログラムなどであり、複数あってもよい。本実施形態では、サービス提供部１１２は、ネットワークブート用のサーバプログラムとして必要な、ＤＨＣＰ（ＤｙｎａｍｉｃＨｏｓｔＣｏｎｆｉｇｕｒａｔｉｏｎＰｒｏｔｏｃｏｌ）サーバ、及び、ＴＦＴＰ（ＴｒｉｖｉａｌＦｉｌｅＴｒａｎｓｆｅｒＰｒｏｔｏｃｏｌ）サーバである。ＤＨＣＰサーバは、クライアント計算機のＩＰアドレスなどのネットワーク情報を自動設定するためのサーバプログラムである。ＴＦＴＰサーバは、ユーザ名等の認証無しでクライアント計算機へのファイル転送を行うためのサーバプログラムである。転送されるファイルは、クライアント計算機が起動するときに必要なＯＳやアプリケーションのファイル群から構成される起動イメージである。 The service providing unit 112 is, for example, a server program, an application program, a business processing program, or the like, and a plurality of service providing units 112 may be provided. In the present embodiment, the service providing unit 112 is a DHCP (Dynamic Host Configuration Protocol) server and a TFTP (Trivial File Transfer Protocol) server that are necessary as a server program for network booting. The DHCP server is a server program for automatically setting network information such as an IP address of a client computer. The TFTP server is a server program for transferring a file to a client computer without authentication of a user name or the like. The file to be transferred is a startup image composed of a group of OS and application files necessary when the client computer is started.

待機系のサーバ計算機１０２上のサービス提供部１１２は、クラスタ処理部１１１の制御により、起動されないため動作しない。なお、サーバ計算機１０２が現用系に切り替わった場合、クラスタ処理部１１１の制御により、サービス提供部１１２は起動され、動作を開始する。 The service providing unit 112 on the standby server computer 102 does not operate because it is not activated under the control of the cluster processing unit 111. When the server computer 102 is switched to the active system, the service providing unit 112 is activated and starts operating under the control of the cluster processing unit 111.

一方、現用系のサービス提供部１１２は、クライアント計算機１０４〜１０６にサービスを提供すると共に、図２に示す、サービス提供部動作履歴管理テーブル１１３に、リアルタイムに動作履歴を格納する。 On the other hand, the active service providing unit 112 provides services to the client computers 104 to 106, and stores an operation history in real time in the service providing unit operation history management table 113 shown in FIG.

図２は、サービス提供部動作履歴テーブル１１３の構成を示す図である。サービス提供部動作履歴テーブル１１３は、日付２０１と、時刻２０２と、ホスト名２０３と、サービス提供部名２０４と、内容２０５とが対応付けられた動作履歴情報を格納する。日付２０１は、サービス提供部１１２が動作した日付を格納する。時刻２０２は、サービス提供部１１２が動作した時刻を格納する。ホスト名２０３は、サービス提供部１１２をアクセスしたクライアント計算機を特定するホスト名を格納する。サービス提供部名２０４は、サービス提供部１１２を特定するサービス提供部の名称を格納する。内容２０５は、サービス提供部１１２の動作の詳細を格納する。 FIG. 2 is a diagram illustrating a configuration of the service providing unit operation history table 113. The service providing unit operation history table 113 stores operation history information in which the date 201, the time 202, the host name 203, the service providing unit name 204, and the content 205 are associated with each other. The date 201 stores the date on which the service providing unit 112 operates. The time 202 stores the time when the service providing unit 112 operates. The host name 203 stores a host name that identifies the client computer that has accessed the service providing unit 112. The service providing unit name 204 stores the name of the service providing unit that identifies the service providing unit 112. The content 205 stores details of the operation of the service providing unit 112.

具体的に、図２を参照して、サービス提供部１１２（ＤＨＣＰサーバ、ＴＦＴＰサーバ）が格納する動作履歴の一例を説明する。以下、"client1"、"client2"、"client3"は、クライアント計算機１０４〜１０６のホスト名を示す。また、"dhcpd"は、ＤＨＣＰサーバのプログラム名を、"tftpd"は、ＴＦＴＰサーバのプログラム名を示す。 Specifically, an example of an operation history stored in the service providing unit 112 (DHCP server, TFTP server) will be described with reference to FIG. Hereinafter, “client1”, “client2”, and “client3” indicate host names of the client computers 104 to 106. “Dhcpd” indicates the program name of the DHCP server, and “tftpd” indicates the program name of the TFTP server.

第１行目の動作履歴２１０〜第４行目の動作履歴２４０は、クライアント計算機１０４"client1"がネットワークブートした場合の、サービス提供部１１２の動作履歴の内容を示している。クライアント計算機１０４"client1"のネットワークブートは、ＤＨＣＰサーバ"dhcpd"がクライアント計算機１０４"client1"にＩＰアドレスを提供した後に、ＴＦＴＰサーバ"tftpd"がクライアント計算機１０４"client1"に起動イメージをファイル転送することにより実行される。 The operation history 210 in the first line to the operation history 240 in the fourth line indicate the contents of the operation history of the service providing unit 112 when the client computer 104 “client1” is network booted. In the network boot of the client computer 104 “client1”, after the DHCP server “dhcpd” provides an IP address to the client computer 104 “client1”, the TFTP server “tftpd” transfers the startup image to the client computer 104 “client1” as a file. Is executed.

動作履歴２１０は、ＤＨＣＰサーバ"dhcpd"が、"2007 01 06"の日付で"12:00:01"の時刻に、クライアント計算機１０４"client1"からのアクセスを受け付け、サービスの提供を開始"start"したことを示す。動作履歴２２０は、ＤＨＣＰサーバ"dhcpd"が、"2007 01 06"の日付で"12:00:03"の時刻に、クライアント計算機１０４"client1"へのサービスの提供を終了"end"したことを示す。なお、終了"end"は、ＤＨＣＰサーバ"dhcpd"がサービスの受け付け待ち状態に入ったことを意味し、動作を停止した状態ではない。動作履歴２３０は、ＴＦＴＰサーバ"tftpd"が、"2007 01 06"の日付で"12:00:04"の時刻に、クライアント計算機１０４"client1"からのアクセスを受け付け、サービスの提供を開始"start"したことを示す。動作履歴２４０は、ＴＦＴＰサーバ"tftpd"が、"2007 01 06"の日付で"12:00:24"の時刻に、クライアント計算機１０４"client1"へのサービスの提供を終了"end"したことを示す。なお、終了"end"は、ＴＦＴＰサーバ"tftpd"がサービスの受け付け待ち状態に入ったことを意味し、動作を停止した状態ではない。 The operation history 210 indicates that the DHCP server “dhcpd” receives access from the client computer 104 “client1” at the time “12:00:01” on the date “2007 01 06” and starts providing the service “start "Indicates that you did. The operation history 220 shows that the DHCP server “dhcpd” has finished “end” providing the service to the client computer 104 “client1” at the time “12:00:03” on the date “2007 01 06”. Show. The end “end” means that the DHCP server “dhcpd” has entered a service acceptance waiting state, and is not in a state where the operation has been stopped. The operation history 230 indicates that the TFTP server “tftpd” accepts access from the client computer 104 “client1” at the time “12:00:04” on the date “2007 01 06” and starts providing the service “start "Indicates that you did. The operation history 240 shows that the TFTP server “tftpd” has finished “end” providing the service to the client computer 104 “client1” at the time “12:00:24” on the date “2007 01 06”. Show. The end “end” means that the TFTP server “tftpd” has entered a service acceptance waiting state, and is not in a state where the operation has been stopped.

以上のように、サービス提供部１１２は、クライアント計算機にサービスを提供すると共に、サービス提供部動作履歴管理テーブル１１３に、リアルタイムに動作履歴を格納する。後述するように、サービス提供部動作履歴テーブル１１３の内容は、サービス提供部１１２がクライアント計算機へのサービス提供を行う処理時間を求めるために使用される。 As described above, the service providing unit 112 provides a service to the client computer, and stores the operation history in the service providing unit operation history management table 113 in real time. As will be described later, the contents of the service providing unit operation history table 113 are used for obtaining a processing time for the service providing unit 112 to provide a service to the client computer.

次に、サービス提供部状況監視部１２１について説明する。 Next, the service providing unit status monitoring unit 121 will be described.

サービス提供部状況監視部１２１は、所定の周期で以下の処理を実行する。サービス提供部動作履歴テーブル１１３に格納された動作履歴を参照して、サービス提供部１１２のサービス提供の処理経過時間を算出し、その時間をサービス提供部状態テーブル１２３に格納する。また、サービス提供部１１２の生死状態を、例えばＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）から取得して、サービス提供部状態テーブル１２３に格納する。そして、サービス提供部状態テーブル１２３に格納したサービス提供部１１２の処理状況（生死状態、処理経過時間）に対応する指示ルールをフェイルオーバ指示ルールテーブル１２４から検索する。また、選定した指示ルールに定められた条件が成立する場合は、フェイルオーバの実行の指示をクラスタ処理部１１１に出す。 The service providing unit status monitoring unit 121 executes the following processing at a predetermined cycle. With reference to the operation history stored in the service providing unit operation history table 113, the service providing process elapsed time of the service providing unit 112 is calculated, and the time is stored in the service providing unit state table 123. Further, the life / death state of the service providing unit 112 is acquired from, for example, an OS (Operating System) and stored in the service providing unit state table 123. Then, an instruction rule corresponding to the processing status (life / death state, elapsed processing time) of the service providing unit 112 stored in the service providing unit state table 123 is searched from the failover instruction rule table 124. In addition, when the condition defined in the selected instruction rule is satisfied, an instruction to execute failover is issued to the cluster processing unit 111.

以下、具体的に、サービス提供部状況監視部１２１の処理を、その処理に使用されるテーブルを参照しながら説明する。なお、サービス提供部状況監視部１２１は、サービス提供部１１２が動作する現用系のサーバ計算機１０２で動作する。 Hereinafter, the processing of the service providing unit status monitoring unit 121 will be specifically described with reference to a table used for the processing. The service providing unit status monitoring unit 121 operates on the active server computer 102 on which the service providing unit 112 operates.

図３は、サービス提供部特性テーブル１２２の構成を示す図である。ユーザは、このテーブルに、サービス提供部１１２の動作の特性を設定しておく。サービス提供部特性テーブル１２２は、サービス提供部名３０１と、平均処理時間３０３と、限界処理時間３０４とが対応付けられた特性情報を格納する。サービス提供部名３０１は、サービス提供部１１２を特定するサービス提供部の名称を格納する。具体的には、本実施形態では、ＤＨＣＰサーバのプログラム名"dhcpd"およびＴＦＴＰサーバのプログラム名"tftpd"を格納する。 FIG. 3 is a diagram illustrating a configuration of the service providing unit characteristic table 122. The user sets the operation characteristics of the service providing unit 112 in this table. The service providing unit characteristic table 122 stores characteristic information in which the service providing unit name 301, the average processing time 303, and the limit processing time 304 are associated with each other. The service providing unit name 301 stores the name of the service providing unit that identifies the service providing unit 112. Specifically, in this embodiment, the DHCP server program name “dhcpd” and the TFTP server program name “tftpd” are stored.

平均処理時間３０３は、サービス提供部１１２がクライアント計算機にサービスを提供する処理に要する平均的な処理時間（秒）を格納する。具体的には、例えば、図２に示すサービス提供部動作履歴テーブル１１３の動作履歴に基づいて求めた処理時間を、平均値と想定し、その値を格納する。すなわち、動作履歴２１０及び動作履歴２２０の時刻２０２の差分から、ＤＨＣＰサーバ"dhcpd"がクライアント計算機にサービスを提供するために要する時間は、"2"となる。同様に、動作履歴２３０及び動作履歴２４０の時刻２０２の差分から、ＴＦＴＰサーバ"tftpd"がクライアント計算機にサービスを提供するために要する時間は、"20"となる。 The average processing time 303 stores an average processing time (seconds) required for the process in which the service providing unit 112 provides a service to the client computer. Specifically, for example, the processing time obtained based on the operation history of the service providing unit operation history table 113 shown in FIG. 2 is assumed to be an average value, and the value is stored. In other words, the time required for the DHCP server “dhcpd” to provide a service to the client computer is “2” based on the difference between the time 202 of the operation history 210 and the operation history 220. Similarly, from the difference between the time 202 of the operation history 230 and the operation history 240, the time required for the TFTP server “tftpd” to provide a service to the client computer is “20”.

限界処理時間３０４は、何らかの異常によって、サービス提供部１１２の応答時間が低下した場合に、サービス提供部１１２の要求性能として、限界となる処理時間（秒）を格納する。例えば、ここでは、平均処理時間３０３の３倍を限界処理時間とし、限界処理時間３０４は、それぞれ"6"、"60"を格納する。なお、クライアント計算機１０４〜１０６からの同時アクセスの総数Ｎによって、サービス提供部１１２の処理時間が比例してＮ倍となる場合、限界処理時間３０４には、比例関係式として、例えば"6×Ｎ"、"60×Ｎ"格納することができる。ここで、同時アクセスの総数とは、ある処理の開始時点からの平均処理時間内のアクセス数をいう。以下の例では、処理時間は、同時アクセスの総数Ｎに無関係であるものとする。 The limit processing time 304 stores a limit processing time (seconds) as the required performance of the service providing unit 112 when the response time of the service providing unit 112 decreases due to some abnormality. For example, here, three times the average processing time 303 is set as the limit processing time, and the limit processing time 304 stores “6” and “60”, respectively. When the processing time of the service providing unit 112 is proportionally increased N times by the total number N of simultaneous accesses from the client computers 104 to 106, the limit processing time 304 is expressed as a proportional relational expression, for example, “6 × N “,“ 60 × N ”can be stored. Here, the total number of simultaneous accesses refers to the number of accesses within the average processing time from the start of a certain process. In the following example, it is assumed that the processing time is independent of the total number N of simultaneous accesses.

図４は、サービス提供部状態テーブル１２３の構成を示す図である。サービス提供部状態テーブル１２３は、サービス提供部状況監視部１２１によって定期的に更新され、サービス提供部１１２の生死状態及び処理経過時間の最新の情報が格納される。サービス提供部状態テーブル１２３は、サービス提供部名４０１と、生死状態４０３と、処理経過時間４０４とが対応付けられた状態情報を格納する。サービス提供部名４０１は、サービス提供部１１２を特定するサービス提供部の名称を格納する。本実施形態では、監視対象のＤＨＣＰサーバのプログラム名"dhcpd"およびＴＦＴＰサーバのプログラム名"tftpd"を格納する。 FIG. 4 is a diagram illustrating a configuration of the service providing unit state table 123. The service providing unit status table 123 is periodically updated by the service providing unit status monitoring unit 121 and stores the latest information on the life / death status of the service providing unit 112 and the elapsed processing time. The service providing unit state table 123 stores state information in which the service providing unit name 401, the life / death state 403, and the processing elapsed time 404 are associated with each other. The service providing unit name 401 stores the name of the service providing unit that identifies the service providing unit 112. In this embodiment, the program name “dhcpd” of the DHCP server to be monitored and the program name “tftpd” of the TFTP server are stored.

生死状態４０３は、サービス提供部１１２の生死状態（稼動中又は停止中）を格納する。具体的には、稼動中として"alive"、停止中として"dead"を格納する。なお、生死状態は、サービス提供部状況監視部１２１が、例えばＯＳに問い合わせることによって取得される。 The life / death state 403 stores the life / death state (operating or stopped) of the service providing unit 112. Specifically, “alive” is stored as active and “dead” is stored as stopped. The life / death state is acquired when the service providing unit status monitoring unit 121 makes an inquiry to the OS, for example.

処理経過時間４０４は、サービス提供部１１２がクライアント計算機にサービスを提供する処理を開始してからの経過時間（秒）を格納する。 The process elapsed time 404 stores the elapsed time (seconds) after the service providing unit 112 starts the process of providing a service to the client computer.

具体的には、サービス提供部状況監視部１２１は、図２に示すサービス提供部動作履歴テーブル１１３を参照して、参照した履歴情報の日付２０１及び時刻２０２と、現在日時との差分から、処理経過時間４０４を求める。例えば、サービス提供部動作履歴テーブル１１３を参照した現在日時が2007年01月06日"2007 01 06"の日付で12時00分14秒"12:00:14"の時刻（履歴情報２４０は未だ格納されていないものとする）であるものとする。この場合、ＤＨＣＰサーバ"dhcpd"の履歴情報２２０において、内容２０５が"end"となっており、また、時刻"12:00:03"は現在時刻より前であるため、ＤＨＣＰサーバ"dhcpd"の処理経過時間は"0"となる。一方、ＴＦＴＰサーバ"tftpd"の履歴情報２３０において、内容２０５が"start"となっているため、処理経過時間は、現在時刻"12:00:14"と時刻"12:00:04"の差分"10"となる。以上のようにして、サービス提供部１１２それぞれの処理経過時間が算出され、処理経過時間４０４に格納される。 Specifically, the service providing unit status monitoring unit 121 refers to the service providing unit operation history table 113 illustrated in FIG. 2 and performs processing based on the difference between the date 201 and time 202 of the referenced history information and the current date and time. The elapsed time 404 is obtained. For example, the current date and time referring to the service providing unit operation history table 113 is a date of “12:00:14” on the date of “January 06, 2007” “12:00:14” (history information 240 is still Suppose that it is not stored). In this case, in the history information 220 of the DHCP server “dhcpd”, the content 205 is “end”, and the time “12:00:03” is before the current time, so the DHCP server “dhcpd” The elapsed processing time is “0”. On the other hand, since the content 205 is “start” in the history information 230 of the TFTP server “tftpd”, the elapsed processing time is the difference between the current time “12:00:14” and the time “12:00:04”. "10". As described above, the elapsed processing time of each service providing unit 112 is calculated and stored in the elapsed processing time 404.

図５は、フェイルオーバ指示ルールテーブル１２４の構成を示す図である。ユーザは、このテーブルに、サービス提供部１１２の処理状況に応じたフェイルオーバの動作を指定するルールを設定しておく。ユーザが任意に設定できることにより、フェイルオーバを木目細かく制御することも可能となる。フェイルオーバ指示ルールテーブル１２４は、サービス提供部１１２毎の生死状態５０１と、サービス提供部１１２毎の処理経過時間５０２と、ルール５０３とが対応付けられた指示ルール情報を格納する。 FIG. 5 is a diagram showing the configuration of the failover instruction rule table 124. The user sets a rule for designating the failover operation according to the processing status of the service providing unit 112 in this table. Since the user can arbitrarily set the failover, it is possible to finely control the failover. The failover instruction rule table 124 stores instruction rule information in which a life / death state 501 for each service providing unit 112, a processing elapsed time 502 for each service providing unit 112, and a rule 503 are associated with each other.

生死状態５０１は、サービス提供部状態テーブル１２３に格納された監視対象のサービス提供部１１２のサービス提供部名４０１についての生死状態を格納する。同様に、処理時間５０２は、監視対象のサービス提供部１１２のサービス提供部名４０１についての処理経過時間（秒）を格納する。これらの値（生死状態、処理経過時間）は、サービス提供部状態テーブル１２３に格納したサービス提供部１１２の処理状況と比較するための基準値若しくは閾値として使用されるデータである。本実施形態では、生死状態５０１は、監視対象であるＤＨＣＰサーバ"dhcpd"及びＴＦＴＰサーバ"tftpd"それぞれの生死状態５０４及び５０５を格納する。また、処理経過時間５０２は、監視対象であるＤＨＣＰサーバ"dhcpd"及びＴＦＴＰサーバ"tftpd"それぞれの処理経過時間５０６及び５０７を格納する。ユーザは、生死状態５０１と処理経過時間５０２の任意の組み合わせと、それらに対応付けたルール５０３を設定する。 The life / death state 501 stores the life / death state for the service providing unit name 401 of the monitored service providing unit 112 stored in the service providing unit state table 123. Similarly, the processing time 502 stores the processing elapsed time (seconds) for the service providing unit name 401 of the monitored service providing unit 112. These values (life / death status, processing elapsed time) are data used as a reference value or threshold value for comparison with the processing status of the service providing unit 112 stored in the service providing unit status table 123. In this embodiment, the life / death state 501 stores the life / death states 504 and 505 of the DHCP server “dhcpd” and the TFTP server “tftpd” to be monitored. The process elapsed time 502 stores the process elapsed times 506 and 507 of the DHCP server “dhcpd” and the TFTP server “tftpd” to be monitored. The user sets an arbitrary combination of the life / death state 501 and the processing elapsed time 502 and a rule 503 associated with them.

ルール５０３は、条件５０８と実行内容５０９を対応付けて格納する。ルール５０３は、上述したように、サービス提供部１１２毎の生死状態５０４及び５０５と、サービス提供部１１２毎の処理経過時間５０６及び５０７と、の組み合わせそれぞれに対応している。条件５０８は、実行内容５０９を実行するための条件を格納する。 The rule 503 stores the condition 508 and the execution content 509 in association with each other. As described above, the rule 503 corresponds to each combination of the life / death states 504 and 505 for each service providing unit 112 and the processing elapsed times 506 and 507 for each service providing unit 112. The condition 508 stores a condition for executing the execution content 509.

例えば、条件５０８が"無し"の場合、サービス提供部状況監視部１２１は、無条件で実行内容５０９の内容を実行する。条件５０８が、"サービス提供部の処理完了直後"の場合、サービス提供部状況監視部１２１は、指定されたサービス提供部の、サービス提供を開始したことを示す履歴情報（内容２０５が"start"）に対応する履歴情報（内容２０５が"end"）がサービス提供部動作履歴テーブル１１３に格納されていることを検出したときに、実行内容５０９の内容を実行する。条件５０８が、"サービス提供部の限界処理時間後"の場合、サービス提供部状況監視部１２１は、指定されたサービス提供部の、サービス提供部状態テーブル１２３の処理経過時間４０４がサービス提供部特性テーブル１２２の限界処理時間３０４を超えていることを検出したときに、実行内容５０９の内容を実行できる。これらの条件内容をＡＮＤ条件やＯＲ条件で結合しても良い。 For example, when the condition 508 is “none”, the service providing unit status monitoring unit 121 executes the content of the execution content 509 unconditionally. When the condition 508 is “immediately after completion of processing by the service providing unit”, the service providing unit status monitoring unit 121 indicates that the specified service providing unit has started providing service information (the content 205 is “start”). ), The content of the execution content 509 is executed when it is detected that the history information (content 205 is “end”) is stored in the service providing unit operation history table 113. When the condition 508 is “after the limit processing time of the service providing unit”, the service providing unit status monitoring unit 121 indicates that the processing elapsed time 404 of the service providing unit state table 123 of the specified service providing unit is the service providing unit characteristic. When it is detected that the limit processing time 304 of the table 122 has been exceeded, the contents of the execution contents 509 can be executed. These condition contents may be combined with an AND condition or an OR condition.

実行内容５０９は、サービス提供部状況監視部１２１が実行すべき処理の内容を格納する。例えば、実行内容５０９が"通常処理"の場合、サービス提供部状況監視部１２１は何も行わない。実行内容５０９が"フェイルオーバ実行"の場合、サービス提供部状況監視部１２１は、クラスタ処理部１１１に対してフェイルオーバ実行を指示する。 The execution content 509 stores the content of processing to be executed by the service providing unit status monitoring unit 121. For example, when the execution content 509 is “normal processing”, the service providing unit status monitoring unit 121 does nothing. When the execution content 509 is “failover execution”, the service providing unit status monitoring unit 121 instructs the cluster processing unit 111 to execute failover.

上記のフェイルオーバ指示ルールテーブル１２４を使用して、サービス提供部状況監視部１２１は、指示ルールに従った処理を定期的に行う。先ず、サービス提供部状況監視部１２１は、サービス提供部状態テーブル１２３に格納された、各サービス提供部４０１の生死状態４０３及び処理経過時間４０４を取得する。その後、取得した生死状態及び処理経過時間と、指示ルール情報５１１〜５４１のうち、生死状態５０１及び処理経過時間５０２の内容とが一致する指示ルール情報を選択する。そして、選択した指示ルール情報の条件５０８が成立するか否かを判定する。成立すると判定した場合は、実行内容５０９の内容を実行する。成立しないと判断した場合は、実行内容５０９の内容を実行しない。以下、各指示ルール情報と、サービス提供部状況監視部１２１の動作について具体的に説明する。 Using the failover instruction rule table 124 described above, the service providing unit status monitoring unit 121 periodically performs processing according to the instruction rule. First, the service providing unit status monitoring unit 121 acquires the life / death state 403 and the elapsed processing time 404 of each service providing unit 401 stored in the service providing unit state table 123. After that, the instruction rule information that matches the contents of the life / death state 501 and the processing elapsed time 502 is selected from the acquired life / death state and processing elapsed time and the instruction rule information 511 to 541. Then, it is determined whether or not the condition 508 of the selected instruction rule information is satisfied. If it is determined that it is established, the content of the execution content 509 is executed. If it is determined that it is not established, the content of the execution content 509 is not executed. Hereinafter, each instruction rule information and the operation of the service providing unit status monitoring unit 121 will be described in detail.

指示ルール情報５１１は、ＤＨＣＰサーバ"dhcpd"及びＴＦＴＰサーバ"tftpd"の両方の生存状態４０３が稼動中"alive"であり、かつ、処理経過時間４０４が、それぞれ限界処理時間"6"秒、"60"秒以下である場合である。この場合、条件５０８は"無し"、実行内容５０９は"通常処理"であるため、サービス提供部状況監視部１２１は、特に何も処理を行わない。 The instruction rule information 511 indicates that the live state 403 of both the DHCP server “dhcpd” and the TFTP server “tftpd” is “alive”, and the processing elapsed time 404 is the limit processing time “6” seconds, “ This is the case for 60 "seconds or less. In this case, since the condition 508 is “none” and the execution content 509 is “normal processing”, the service providing unit status monitoring unit 121 does not perform any particular processing.

指示ルール情報５２１は、ＤＨＣＰサーバ"dhcpd"の生存状態４０３が停止中"dead"であり、かつ、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が"0"秒（サービスの受け付け待ち状態）である場合である。また、指示ルール情報５３１は、ＴＦＴＰサーバ"tftpd"の生存状態４０３が停止中"dead"であり、かつ、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が"0"秒である場合である。そして、指示ルール情報５４１は、ＤＨＣＰサーバ"dhcpd"及びＴＦＴＰサーバ"tftpd"の両方の生存状態４０３が、停止中"dead"である場合である。これらの場合、条件５０８は"無し"、実行内容５０９は"フェイルオーバ実行"であるため、サービス提供部状況監視部１２１は、フェイルオーバの実行指示をクラスタ処理部１１１に出す。 The instruction rule information 521 is “dead” when the live state 403 of the DHCP server “dhcpd” is stopped, and the processing elapsed time 404 of the TFTP server “tftpd” is “0” seconds (waiting for service acceptance). Is the case. Further, the instruction rule information 531 is a case where the live state 403 of the TFTP server “tftpd” is “dead” when stopped, and the processing elapsed time 404 of the DHCP server “dhcpd” is “0” seconds. The instruction rule information 541 corresponds to a case where the survival state 403 of both the DHCP server “dhcpd” and the TFTP server “tftpd” is “dead” during stoppage. In these cases, since the condition 508 is “none” and the execution content 509 is “failover execution”, the service providing unit status monitoring unit 121 issues a failover execution instruction to the cluster processing unit 111.

指示ルール情報５１２は、ＤＨＣＰサーバ"dhcpd"及びＴＦＴＰサーバ"tftpd"の両方の生存状態４０３が、稼動中"alive"であり、かつ、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が限界処理時間"6"秒を超え、かつ、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が"0"秒である場合である。また、指示ルール情報５１４は、ＤＨＣＰサーバ"dhcpd"及びＴＦＴＰサーバ"tftpd"の両方の生存状態４０３が、稼動中"alive"であり、かつ、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が限界処理時間"60"秒を超え、かつ、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が"0"秒である場合である。これらの場合、条件５０８は"無し"、実行内容５０９は"フェイルオーバ実行"であるため、サービス提供部状況監視部１２１は、フェイルオーバの実行指示をクラスタ処理部１１１に出す。このような指示ルール情報により、サービス提供部１１２のうち少なくとも１つの処理時間が限界処理時間を超えていた場合に、フェイルオーバが実行される。そのため、サービス提供部１１２の応答性が低下しても、その応答性の低下を回避することが可能となる。 The instruction rule information 512 indicates that the survival state 403 of both the DHCP server “dhcpd” and the TFTP server “tftpd” is “alive” during operation, and the processing elapsed time 404 of the DHCP server “dhcpd” is the limit processing time ” This is a case where 6 ”seconds are exceeded and the processing elapsed time 404 of the TFTP server“ tftpd ”is“ 0 ”seconds. In addition, the instruction rule information 514 indicates that the live state 403 of both the DHCP server “dhcpd” and the TFTP server “tftpd” is “alive” during operation, and the processing elapsed time 404 of the TFTP server “tftpd” is the limit process. This is a case where the time exceeds “60” seconds and the processing elapsed time 404 of the DHCP server “dhcpd” is “0” seconds. In these cases, since the condition 508 is “none” and the execution content 509 is “failover execution”, the service providing unit status monitoring unit 121 issues a failover execution instruction to the cluster processing unit 111. According to such instruction rule information, failover is executed when at least one processing time of the service providing unit 112 exceeds the limit processing time. Therefore, even if the responsiveness of the service providing unit 112 is lowered, it is possible to avoid the responsiveness from being lowered.

指示ルール情報５１３及び５１５は、それぞれ指示ルール情報５１２及び５１４の内容を、より細かな条件設定としたものである。 The instruction rule information 513 and 515 are obtained by setting the contents of the instruction rule information 512 and 514 to finer conditions.

指示ルール情報５１３は、ＤＨＣＰサーバ"dhcpd"及びＴＦＴＰサーバ"tftpd"の両方の生存状態４０３が、稼動中"alive"であり、かつ、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が限界処理時間"6"秒を超え、かつ、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が"0"秒より大きい（クライアント計算機にサービスを提供中）場合である。条件５０８は、"tftpdの処理完了直後又はtftpdの限界処理時間６０s後"であり、実行内容５０９は"フェイルオーバ実行"である。この場合、サービス提供部状況監視部１２１は、ＴＦＴＰサーバ"tftpd"の、サービス提供を開始したことを示す履歴情報（内容２０５が"start"）に対応する履歴情報（内容２０５が"end"）がサービス提供部動作履歴テーブル１１３に格納されていることを検出した場合、又は、処理経過時間４０４が限界処理時間"60"秒を超えていることを検出した場合に、フェイルオーバの実行指示をクラスタ処理部１１１に出す。この条件５０８が成立しない場合は、フェイルオーバの実行指示を行わない。 The instruction rule information 513 indicates that the live state 403 of both the DHCP server “dhcpd” and the TFTP server “tftpd” is “alive” during operation, and the processing elapsed time 404 of the DHCP server “dhcpd” is the limit processing time ” This is a case where 6 "seconds are exceeded and the processing elapsed time 404 of the TFTP server" tftpd "is longer than" 0 "seconds (service is being provided to the client computer). The condition 508 is “immediately after the completion of tftpd processing or after the limit processing time 60 s of tftpd”, and the execution content 509 is “failover execution”. In this case, the service providing unit status monitoring unit 121 has history information (content 205 is “end”) corresponding to history information (content 205 is “start”) indicating that service provision of the TFTP server “tftpd” has started. Is detected in the service providing unit operation history table 113, or when it is detected that the processing elapsed time 404 exceeds the limit processing time “60” seconds, a failover execution instruction is issued to the cluster. Output to processing unit 111. If this condition 508 is not satisfied, no failover execution instruction is issued.

指示ルール情報５１５は、ＤＨＣＰサーバ"dhcpd"及びＴＦＴＰサーバ"tftpd"の両方の生存状態４０３が、稼動中"alive"であり、かつ、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が"0"秒より大きく、かつ、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が限界処理時間"60"秒を超える場合である。条件５０８は、"dhcpdの処理完了直後又はdhcpdの限界処理時間６s後"であり、実行内容５０９は"フェイルオーバ実行"である。この場合、サービス提供部状況監視部１２１は、ＤＨＣＰサーバ"dhcpd"の、サービス提供を開始したことを示す履歴情報（内容２０５が"start"）に対応する履歴情報（内容２０５が"end"）がサービス提供部動作履歴テーブル１１３に格納されていることを検出した場合、又は、処理経過時間４０４が限界処理時間"6"秒を超えていることを検出した場合に、フェイルオーバの実行指示をクラスタ処理部１１１に出す。この条件５０８が成立しない場合は、フェイルオーバの実行指示を行わない。 The instruction rule information 515 indicates that the live state 403 of both the DHCP server “dhcpd” and the TFTP server “tftpd” is “alive” during operation, and the processing elapsed time 404 of the DHCP server “dhcpd” is “0” seconds. This is a case where the processing elapsed time 404 of the TFTP server “tftpd” exceeds the limit processing time “60” seconds. The condition 508 is “immediately after completion of dhcpd processing or after the limit processing time 6 s of dhcpd”, and the execution content 509 is “failover execution”. In this case, the service providing unit status monitoring unit 121 has history information (content 205 is “end”) corresponding to history information (content 205 is “start”) indicating that the DHCP server “dhcpd” has started providing the service. Is detected in the service providing unit operation history table 113, or when it is detected that the processing elapsed time 404 exceeds the limit processing time "6" seconds, a failover execution instruction is issued to the cluster. Output to processing unit 111. If this condition 508 is not satisfied, no failover execution instruction is issued.

指示ルール情報５１３及び５１５のような指示ルール情報により、サービス提供部１１２のうち少なくとも１つの処理時間が限界処理時間を超えていた場合に、他の停止状態でないサービス提供部１１２が処理中であるならば、処理中のサービス提供部１１２の処理完了後にフェイルオーバが実行される。そのため、サービス提供部１１２のクライアント計算機へのサービス提供が中断されない。なお、無条件でフェイルオーバを実行させる場合、指示ルール情報５１３及び５１５の条件５０８を"無し"に設定する。 According to the instruction rule information such as the instruction rule information 513 and 515, when at least one processing time of the service providing units 112 exceeds the limit processing time, the other service providing units 112 that are not in the stopped state are processing. Then, failover is executed after the processing of the service providing unit 112 being processed is completed. Therefore, service provision to the client computer by the service providing unit 112 is not interrupted. If failover is to be executed unconditionally, the condition 508 of the instruction rule information 513 and 515 is set to “none”.

指示ルール情報５２２及び５３２は、それぞれ指示ルール情報５２１及び５３１の内容を、より細かな条件設定としたものである。 The instruction rule information 522 and 532 are obtained by setting the contents of the instruction rule information 521 and 531 to more detailed conditions, respectively.

指示ルール情報５２２は、ＤＨＣＰサーバ"dhcpd"の生存状態４０３が停止中"dead"であり、かつ、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が"0"秒より大きい（クライアント計算機にサービスを提供中）場合である。条件５０８は、指示ルール情報５１３と同様に、"tftpdの処理完了直後又はtftpdの限界処理時間６０s後"であり、実行内容５０９は"フェイルオーバ実行"である。この場合、サービス提供部状況監視部１２１は、ＴＦＴＰサーバ"tftpd"の、サービス提供を開始したことを示す履歴情報（内容２０５が"start"）に対応する履歴情報（内容２０５が"end"）がサービス提供部動作履歴テーブル１１３に格納されていることを検出した場合、又は、処理経過時間４０４が限界処理時間"60"秒を超えていることを検出した場合に、フェイルオーバの実行指示をクラスタ処理部１１１に出す。この条件５０８が成立しない場合は、フェイルオーバの実行指示を行わない。 The instruction rule information 522 indicates that the live state 403 of the DHCP server “dhcpd” is “dead” and the processing elapsed time 404 of the TFTP server “tftpd” is longer than “0” seconds (provides service to the client computer) Middle). Similarly to the instruction rule information 513, the condition 508 is “immediately after completion of tftpd processing or after tftpd limit processing time 60 s”, and the execution content 509 is “failover execution”. In this case, the service providing unit status monitoring unit 121 has history information (content 205 is “end”) corresponding to history information (content 205 is “start”) indicating that service provision of the TFTP server “tftpd” has started. Is detected in the service providing unit operation history table 113, or when it is detected that the processing elapsed time 404 exceeds the limit processing time “60” seconds, a failover execution instruction is issued to the cluster. Output to processing unit 111. If this condition 508 is not satisfied, no failover execution instruction is issued.

指示ルール情報５３２は、ＴＦＴＰサーバ"tftpd"の生存状態４０３が停止中"dead"であり、かつ、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が"0"秒より大きい（クライアント計算機にサービスを提供中）場合である。条件５０８は、指示ルール情報５１５と同様に、"dhcpdの処理完了直後又はdhcpdの限界処理時間６s後"であり、実行内容５０９は"フェイルオーバ実行"である。この場合、サービス提供部状況監視部１２１は、ＤＨＣＰサーバ"dhcpd"の、サービス提供を開始したことを示す履歴情報（内容２０５が"start"）に対応する履歴情報（内容２０５が"end"）がサービス提供部動作履歴テーブル１１３に格納されていることを検出した場合、又は、処理経過時間４０４が限界処理時間"6"秒を超えていることを検出した場合に、フェイルオーバの実行指示をクラスタ処理部１１１に出す。この条件５０８が成立しない場合は、フェイルオーバの実行指示を行わない。 The instruction rule information 532 indicates that the live state 403 of the TFTP server “tftpd” is “dead” and the processing elapsed time 404 of the DHCP server “dhcpd” is longer than “0” seconds (provides service to the client computer) Middle). The condition 508 is “immediately after the completion of dhcpd processing or after the dhcpd limit processing time 6 s” as in the case of the instruction rule information 515, and the execution content 509 is “failover execution”. In this case, the service providing unit status monitoring unit 121 has history information (content 205 is “end”) corresponding to history information (content 205 is “start”) indicating that the DHCP server “dhcpd” has started providing the service. Is detected in the service providing unit operation history table 113, or when it is detected that the processing elapsed time 404 exceeds the limit processing time "6" seconds, a failover execution instruction is issued to the cluster. Output to processing unit 111. If this condition 508 is not satisfied, no failover execution instruction is issued.

指示ルール情報５２２及び５３２のような指示ルール情報により、サービス提供部１１２のうち少なくとも１つが停止していた場合でも、他の停止状態でないサービス提供部１１２が処理中であるならば、処理中のサービス提供部１１２の処理完了後にフェイルオーバが実行される。そのため、サービス提供部１１２のクライアント計算機へのサービス提供が中断されない。なお、無条件でフェイルオーバを実行させる場合、指示ルール情報５２２及び５３２の条件５０８を"無し"に設定する。 Even if at least one of the service providing units 112 is stopped by the instruction rule information such as the instruction rule information 522 and 532, if another service providing unit 112 that is not in the stopped state is processing, Failover is executed after the processing of the service providing unit 112 is completed. Therefore, service provision to the client computer by the service providing unit 112 is not interrupted. If failover is to be executed unconditionally, the condition 508 of the instruction rule information 522 and 532 is set to “none”.

以上のように指示ルールを設定することにより、ネットワークブートシステムにおいて、現用系のサーバ計算機、又は、ＤＨＣＰサーバ若しくはＴＦＴＰサーバがダウンした場合であっても、クライアント計算機がネットワークブートできなくなる事態を回避することができる。また、ＤＨＣＰサーバ若しくはＴＦＴＰサーバの応答性が低下した場合であっても、クラアント計算機がネットワークブートによる起動時間のタイムアウト等によって起動ができなくなる事態を回避することができる。 By setting the instruction rule as described above, in the network boot system, even when the active server computer, the DHCP server or the TFTP server is down, the situation where the client computer cannot be network booted is avoided. be able to. Moreover, even when the responsiveness of the DHCP server or the TFTP server is lowered, it is possible to avoid a situation in which the client computer cannot be started due to a timeout of the startup time due to the network boot.

次に、上記のように構成されるサービス提供部状況監視部１２１の動作を説明する。 Next, the operation of the service providing unit status monitoring unit 121 configured as described above will be described.

図６は、サービス提供部状況監視部１２１の処理の流れを示すフロー図である。サービス提供部状況監視部１２１は、ユーザにより指定された周期で、定期的に以下のフローを実行する。例えば、周期は、ネットワークブート用のサービス提供部（ＤＨＣＰサーバ、ＴＦＴＰサーバ）では秒オーダの処理であるので、１秒程度とするのが好ましい。 FIG. 6 is a flowchart showing a processing flow of the service providing unit status monitoring unit 121. The service providing unit status monitoring unit 121 periodically executes the following flow at a cycle specified by the user. For example, the period is preferably about 1 second since the service providing unit for network booting (DHCP server, TFTP server) is a second order process.

先ず、サービス提供部状況監視部１２１は、サービス提供部動作履歴テーブル１１３に格納されているサービス提供部１１２の動作履歴を読み込む（Ｓ６０１）。 First, the service providing unit status monitoring unit 121 reads the operation history of the service providing unit 112 stored in the service providing unit operation history table 113 (S601).

それから、サービス提供部状況監視部１２１は、サービス提供部状態テーブル１２３を更新する（Ｓ６０２）。具体的には、サービス提供部１１２の稼動状態（稼動中"alive"又は停止中"dead"）をＯＳに問い合わせ、取得した稼動状態を生死状態４０３に格納する。また、読み込んだ動作履歴を順番にチェックし、ホスト名２０３の名称毎およびサービス提供部名２０４毎に分類する。そして、サービス提供部毎の動作履歴の内容２０５を参照する。開始"start"に対応する終了"end"が記録されている動作履歴がある場合、"0"秒を処理経過時間４０４に格納する。開始"start"のみが記録されている場合、当該動作履歴の日付２０１及び時刻２０２と、現在日時の差分から求めた時間を処理経過時間４０４に格納する。 Then, the service providing unit status monitoring unit 121 updates the service providing unit state table 123 (S602). Specifically, the operating state of the service providing unit 112 (operating “alive” or stopped “dead”) is inquired of the OS, and the acquired operating state is stored in the life / death state 403. Further, the read operation history is checked in order, and is classified for each name of the host name 203 and for each service providing unit name 204. Then, the content 205 of the operation history for each service providing unit is referred to. When there is an operation history in which an end “end” corresponding to the start “start” is recorded, “0” seconds are stored in the process elapsed time 404. When only the start “start” is recorded, the time obtained from the difference between the date 201 and time 202 of the operation history and the current date is stored in the process elapsed time 404.

次に、サービス提供部状況監視部１２１は、Ｓ６０２で更新したサービス提供部状態テーブル１２３の各サービス提供部１１２の生死状態４０３が、フェイルオーバ指示ルールテーブル１２４の各指示ルール情報の生死状態５０１と一致するか判定し、一致する指示ルール情報を選択する（Ｓ６０３）。 Next, in the service providing unit status monitoring unit 121, the life / death state 403 of each service providing unit 112 in the service providing unit state table 123 updated in S602 matches the life / death state 501 of each instruction rule information in the failover instruction rule table 124. The matching instruction rule information is selected (S603).

また、サービス提供部状況監視部１２１は、Ｓ６０２で更新したサービス提供部状態テーブル１２３の各サービス提供部１１２の処理経過時間４０４が、Ｓ６０３で選択した指示ルール情報の処理経過時間５０２と一致するか判定し、一致する指示ルール情報を選択する（Ｓ６０４）。Ｓ６０３、Ｓ６０４の処理を経て、指示ルール情報５１１〜５４１のうちのいずれか一つが選定される。 Also, the service providing unit status monitoring unit 121 determines whether the processing elapsed time 404 of each service providing unit 112 in the service providing unit state table 123 updated in S602 matches the processing elapsed time 502 of the instruction rule information selected in S603. Determination is made and matching instruction rule information is selected (S604). Through the processing of S603 and S604, any one of the instruction rule information 511 to 541 is selected.

次に、サービス提供部状況監視部１２１は、Ｓ６０４で選定した指示ルール情報の条件５０８を、サービス提供部１１２の状態が満たすか否かを判定する（Ｓ６０５）。具体的には、条件５０８が“無し”の場合、Ｓ６０６に進む（Ｓ６０５で“ＹＥＳ”）。条件５０８が“無し”以外の場合、指定されたサービス提供部の処理が完了しているか、又は、指定されたサービス提供部の処理時間が限界処理時間を超えているか否かを判定する。そして、処理完了直後、又は、限界処理時間経過後であることを検出した場合、その条件５０８を満たしているので、Ｓ６０６に進む（Ｓ６０５で“ＹＥＳ”）。一方、条件５０８を満たさない場合は、処理を終了する（Ｓ６０５で“ＮＯ”、ＥＮＤ）。 Next, the service providing unit status monitoring unit 121 determines whether or not the state of the service providing unit 112 satisfies the condition 508 of the instruction rule information selected in S604 (S605). Specifically, when the condition 508 is “none”, the process proceeds to S606 (“YES” in S605). When the condition 508 is other than “None”, it is determined whether the processing of the designated service providing unit is completed or whether the processing time of the designated service providing unit exceeds the limit processing time. If it is detected immediately after the completion of processing or after the elapse of the limit processing time, the condition 508 is satisfied, and the process proceeds to S606 (“YES” in S605). On the other hand, if the condition 508 is not satisfied, the processing is terminated (“NO” in S605, END).

Ｓ６０６では、サービス提供部状況監視部１２１は、選定した指示ルール情報の実行内容５０９の内容を実行する（Ｓ６０６）。具体的には、実行内容５０９が"通常処理"の場合、サービス提供部状況監視部１２１は何も行わない。実行内容５０９が"フェイルオーバ実行"の場合、サービス提供部状況監視部１２１は、クラスタ処理部１１１に対してフェイルオーバ実行を指示する。クラスタ処理部１１１は、指示を受けると、現用系サーバ計算機１０２から待機系のサーバ計算機１０２にリソースの引継ぎを行い、待機系と現用系を切り替える。 In S606, the service providing unit status monitoring unit 121 executes the content of the execution content 509 of the selected instruction rule information (S606). Specifically, when the execution content 509 is “normal processing”, the service providing unit status monitoring unit 121 does nothing. When the execution content 509 is “failover execution”, the service providing unit status monitoring unit 121 instructs the cluster processing unit 111 to execute failover. Upon receiving the instruction, the cluster processing unit 111 takes over resources from the active server computer 102 to the standby server computer 102 and switches between the standby system and the active system.

以上の各ステップを実行し、サービス提供部状況監視部１２１は、処理を終了する。以降も同様に、定期的に上記のフローを実行する。 The above steps are executed, and the service providing unit status monitoring unit 121 ends the process. Similarly, the above-described flow is periodically executed thereafter.

以上、第１の実施形態について説明した。第１の実施形態によれば、クラスタシステムにおいて、現用系サーバ計算機上でサーバプログラムのうちの少なくとも１つが性能要求となる限界処理時間内に処理を終了していない場合に、待機系サーバ計算機上へフェイルオーバが実行される。これにより、サーバプログラムに高負荷状態が発生した場合にも、待機系サーバ計算機上へフェイルオーバが実行されるため、サーバ計算機のハードウェア障害だけでなく、サーバプログラムの応答性の低下にも対処ができる。 The first embodiment has been described above. According to the first embodiment, in the cluster system, when at least one of the server programs on the active server computer has not finished processing within the limit processing time that is a performance requirement, the standby server computer Failover is executed. As a result, even when a heavy load occurs in the server program, failover is performed on the standby server computer, so that not only hardware failure of the server computer but also a decrease in responsiveness of the server program can be dealt with. it can.

また、現用系サーバ計算機上でサーバプログラムのうちの少なくとも１つが限界処理時間内に処理を終了していない場合に、他のサーバプログラムの処理完了を待った後に、待機系サーバ計算機上へフェイルオーバを実行する。これにより、フェイルオーバ時にサーバプログラムとクライアント計算機の間の処理を中断させなくて済む。このように、サービスの無停止化を実現する高信頼かつ高応答なクラスタシステムを提供できる。 In addition, if at least one of the server programs on the active server computer has not finished processing within the limit processing time, it waits for the other server programs to complete and then performs failover to the standby server computer. To do. Thereby, it is not necessary to interrupt the processing between the server program and the client computer at the time of failover. In this way, a highly reliable and highly responsive cluster system that can realize non-stop service can be provided.

次に、第１の実施形態の変形例について説明する。この変形例は、第１の実施形態の一部をより簡単な構成としたものである。以下、この変形例について、第１の実施形態と異なる点を中心に、図８、９、１０を用いて説明する。第１の実施形態と同様に、ネットワークブートシステムに本発明を適用する構成を例に説明する。システム構成及びサーバ計算機の機能構成（図１）、サービス提供部動作履歴テーブル１１３（図２）、サービス提供部特性テーブル１２２（図３）、サーバ計算機のハードウェア構成（図７）は、第１の実施形態と同様であるため説明を省略する。 Next, a modification of the first embodiment will be described. In this modification, a part of the first embodiment has a simpler configuration. Hereinafter, this modification will be described with reference to FIGS. 8, 9, and 10, focusing on differences from the first embodiment. Similar to the first embodiment, a configuration in which the present invention is applied to a network boot system will be described as an example. The system configuration and the functional configuration of the server computer (FIG. 1), the service providing unit operation history table 113 (FIG. 2), the service providing unit characteristic table 122 (FIG. 3), and the hardware configuration of the server computer (FIG. 7) are as follows. Since it is the same as that of the embodiment, the description is omitted.

この変形例は、第１の実施形態と異なり、図８に示すサービス提供部状態テーブル１２３と、図９に示すフェイルオーバ指示ルールテーブル１２４を有する。すなわち、これらのテーブルを用いるサービス提供部状況監視部１２１の処理が異なる。 Unlike the first embodiment, this modification has a service providing unit state table 123 shown in FIG. 8 and a failover instruction rule table 124 shown in FIG. That is, the processing of the service providing unit status monitoring unit 121 using these tables is different.

図８は、変形例のサービス提供部状態テーブル１２３の構成を示す図である。第１の実施形態（図４）と異なるのは、生死状態４０３を省いた構成となっている点である。サービス提供部状態テーブル１２３は、サービス提供部状況監視部１２１によって定期的に更新され、サービス提供部１１２の処理経過時間の最新の情報が格納される。サービス提供部状態テーブル１２３は、サービス提供部名４０１と、処理経過時間４０４とが対応付けられた状態情報を格納する。処理経過時間４０４の内容は、第１の実施形態と同様である。 FIG. 8 is a diagram illustrating a configuration of the service providing unit state table 123 according to the modification. The difference from the first embodiment (FIG. 4) is that the life / death state 403 is omitted. The service providing unit status table 123 is periodically updated by the service providing unit status monitoring unit 121 and stores the latest information on the elapsed processing time of the service providing unit 112. The service providing unit state table 123 stores state information in which the service providing unit name 401 and the processing elapsed time 404 are associated with each other. The contents of the processing elapsed time 404 are the same as those in the first embodiment.

図９は、変形例のフェイルオーバ指示ルールテーブル１２４の構成を示す図である。第１の実施形態（図５）と異なるのは、サービス提供部１１２毎の生死状態５０１を省いた構成となっている点である。フェイルオーバ指示ルールテーブル１２４は、ユーザの設定により、サービス提供部１１２毎の処理経過時間５０２と、ルール５０３とが対応付けられた指示ルール情報を格納する。処理経過時間５０２及びルール５０３の内容は、第１の実施形態と同様である。 FIG. 9 is a diagram illustrating a configuration of a failover instruction rule table 124 according to a modification. The difference from the first embodiment (FIG. 5) is that the life / death state 501 for each service providing unit 112 is omitted. The failover instruction rule table 124 stores instruction rule information in which an elapsed processing time 502 for each service providing unit 112 and a rule 503 are associated with each other according to user settings. The contents of the processing elapsed time 502 and the rule 503 are the same as those in the first embodiment.

上記のフェイルオーバ指示ルールテーブル１２４を使用して、サービス提供部状況監視部１２１は、指示ルールに従った処理を定期的に行う。先ず、サービス提供部状況監視部１２１は、サービス提供部状態テーブル１２３に格納された、各サービス提供部４０１の処理経過時間４０４を取得する。その後、取得した処理経過時間と、指示ルール情報９１１〜９１５のうち、処理経過時間５０２の内容が一致する指示ルール情報を選択する。そして、選択した指示ルール情報の条件５０８が成立するか否かを判定する。成立すると判定した場合は、実行内容５０９の内容を実行する。成立しないと判断した場合は、実行内容５０９の内容を実行しない。以下、各指示ルール情報と、サービス提供部状況監視部１２１の動作について具体的に説明する。 Using the failover instruction rule table 124 described above, the service providing unit status monitoring unit 121 periodically performs processing according to the instruction rule. First, the service providing unit status monitoring unit 121 acquires the processing elapsed time 404 of each service providing unit 401 stored in the service providing unit state table 123. Thereafter, the instruction rule information that matches the content of the process elapsed time 502 is selected from the acquired process elapsed time and the instruction rule information 911 to 915. Then, it is determined whether or not the condition 508 of the selected instruction rule information is satisfied. If it is determined that it is established, the content of the execution content 509 is executed. If it is determined that it is not established, the content of the execution content 509 is not executed. Hereinafter, each instruction rule information and the operation of the service providing unit status monitoring unit 121 will be described in detail.

指示ルール情報９１１は、ＤＨＣＰサーバ"dhcpd"及びＴＦＴＰサーバ"tftpd"の両方の処理経過時間４０４が、それぞれ限界処理時間"6"秒、"60"秒以下である場合である。この場合、条件５０８は"無し"、実行内容５０９は"通常処理"であるため、サービス提供部状況監視部１２１は、特に何も処理を行わない。 The instruction rule information 911 is when the processing elapsed time 404 of both the DHCP server “dhcpd” and the TFTP server “tftpd” is the limit processing time “6” seconds and “60” seconds or less, respectively. In this case, since the condition 508 is “none” and the execution content 509 is “normal processing”, the service providing unit status monitoring unit 121 does not perform any particular processing.

指示ルール情報９１２は、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が限界処理時間"6"秒を超え、かつ、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が"0"秒である場合である。また、指示ルール情報９１４は、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が限界処理時間"60"秒を超え、かつ、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が"0"秒である場合である。これらの場合、条件５０８は"無し"、実行内容５０９は"フェイルオーバ実行"であるため、サービス提供部状況監視部１２１は、フェイルオーバの実行指示をクラスタ処理部１１１に出す。このような指示ルール情報により、サービス提供部１１２のうち少なくとも１つの処理時間が限界処理時間を超えていた場合に、フェイルオーバが実行される。そのため、サービス提供部１１２の応答性が低下しても、その応答性の低下を回避することが可能となる。 The instruction rule information 912 is a case where the processing elapsed time 404 of the DHCP server “dhcpd” exceeds the limit processing time “6” seconds and the processing elapsed time 404 of the TFTP server “tftpd” is “0” seconds. The instruction rule information 914 indicates that the process elapsed time 404 of the TFTP server “tftpd” exceeds the limit process time “60” seconds and the process elapsed time 404 of the DHCP server “dhcpd” is “0” seconds. is there. In these cases, since the condition 508 is “none” and the execution content 509 is “failover execution”, the service providing unit status monitoring unit 121 issues a failover execution instruction to the cluster processing unit 111. According to such instruction rule information, failover is executed when at least one processing time of the service providing unit 112 exceeds the limit processing time. Therefore, even if the responsiveness of the service providing unit 112 is lowered, it is possible to avoid the responsiveness from being lowered.

指示ルール情報９１３は、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が限界処理時間"6"秒を超え、かつ、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が"0"秒より大きい（クライアント計算機にサービスを提供中）場合である。条件５０８は、"tftpdの処理完了直後又はtftpdの限界処理時間６０s後"であり、実行内容５０９は"フェイルオーバ実行"である。この場合、サービス提供部状況監視部１２１は、ＴＦＴＰサーバ"tftpd"の、サービス提供を開始したことを示す履歴情報（内容２０５が"start"）に対応する履歴情報（内容２０５が"end"）がサービス提供部動作履歴テーブル１１３に格納されていることを検出した場合、又は、処理経過時間４０４が限界処理時間"60"秒を超えていることを検出した場合に、フェイルオーバの実行指示をクラスタ処理部１１１に出す。この条件５０８が成立しない場合は、フェイルオーバの実行指示を行わない。 The instruction rule information 913 indicates that the processing elapsed time 404 of the DHCP server “dhcpd” exceeds the limit processing time “6” seconds and the processing elapsed time 404 of the TFTP server “tftpd” is greater than “0” seconds (in the client computer). Service). The condition 508 is “immediately after the completion of tftpd processing or after the limit processing time 60 s of tftpd”, and the execution content 509 is “failover execution”. In this case, the service providing unit status monitoring unit 121 has history information (content 205 is “end”) corresponding to history information (content 205 is “start”) indicating that service provision of the TFTP server “tftpd” has started. Is detected in the service providing unit operation history table 113, or when it is detected that the processing elapsed time 404 exceeds the limit processing time “60” seconds, a failover execution instruction is issued to the cluster. Output to processing unit 111. If this condition 508 is not satisfied, no failover execution instruction is issued.

指示ルール情報９１５は、ＤＨＣＰサーバ"dhcpd"の処理経過時間４０４が"0"秒より大きく、かつ、ＴＦＴＰサーバ"tftpd"の処理経過時間４０４が限界処理時間"60"秒を超える場合である。条件５０８は、"dhcpdの処理完了直後又はdhcpdの限界処理時間６s後"であり、実行内容５０９は"フェイルオーバ実行"である。この場合、サービス提供部状況監視部１２１は、ＤＨＣＰサーバ"dhcpd"の、サービス提供を開始したことを示す履歴情報（内容２０５が"start"）に対応する履歴情報（内容２０５が"end"）がサービス提供部動作履歴テーブル１１３に格納されていることを検出した場合、又は、処理経過時間４０４が限界処理時間"6"秒を超えていることを検出した場合に、フェイルオーバの実行指示をクラスタ処理部１１１に出す。この条件５０８が成立しない場合は、フェイルオーバの実行指示を行わない。 The instruction rule information 915 is a case where the processing elapsed time 404 of the DHCP server “dhcpd” is greater than “0” seconds and the processing elapsed time 404 of the TFTP server “tftpd” exceeds the limit processing time “60” seconds. The condition 508 is “immediately after completion of dhcpd processing or after the limit processing time 6 s of dhcpd”, and the execution content 509 is “failover execution”. In this case, the service providing unit status monitoring unit 121 has history information (content 205 is “end”) corresponding to history information (content 205 is “start”) indicating that the DHCP server “dhcpd” has started providing the service. Is detected in the service providing unit operation history table 113, or when it is detected that the processing elapsed time 404 exceeds the limit processing time "6" seconds, a failover execution instruction is issued to the cluster. Output to processing unit 111. If this condition 508 is not satisfied, no failover execution instruction is issued.

指示ルール情報９１３及び９１５のような指示ルール情報により、サービス提供部１１２のうち少なくとも１つの処理時間が限界処理時間を超えていた場合に、他の停止状態でないサービス提供部１１２が処理中であるならば、処理中のサービス提供部１１２の処理完了後にフェイルオーバが実行される。そのため、サービス提供部１１２のクライアント計算機へのサービス提供が中断されない。 According to the instruction rule information such as the instruction rule information 913 and 915, when at least one processing time of the service providing units 112 exceeds the limit processing time, another service providing unit 112 that is not in a stopped state is processing. Then, failover is executed after the processing of the service providing unit 112 being processed is completed. Therefore, service provision to the client computer by the service providing unit 112 is not interrupted.

以上のように指示ルールを設定することにより、ネットワークブートシステムにおいて、ＤＨＣＰサーバ若しくはＴＦＴＰサーバの応答性が低下した場合であっても、クラアント計算機がネットワークブートによる起動時間のタイムアウト等によって起動ができなくなる事態を回避することができる。 By setting the instruction rule as described above, even if the response of the DHCP server or the TFTP server is lowered in the network boot system, the client computer cannot be started due to a timeout of the boot time due to the network boot. The situation can be avoided.

図１０は、変形例のサービス提供部状況監視部１２１の処理の流れを示すフロー図である。第１の実施形態（図６）と異なるのは、Ｓ６０３を省いた処理となっている点である。また、サービス提供部状態テーブル１２３（図８）と、フェイルオーバ指示ルールテーブル１２４（図９）に応じた処理（Ｓ６０２´、Ｓ６０４´）を実行する点である。Ｓ６０１、Ｓ６０５、Ｓ６０６は、第１の実施形態と同様であるので説明を省略する。 FIG. 10 is a flowchart showing a process flow of the service providing unit status monitoring unit 121 according to the modification. The difference from the first embodiment (FIG. 6) is that the process is omitted from S603. In addition, processing (S602 ′, S604 ′) corresponding to the service providing unit state table 123 (FIG. 8) and the failover instruction rule table 124 (FIG. 9) is executed. Since S601, S605, and S606 are the same as those in the first embodiment, description thereof is omitted.

Ｓ６０１の後、サービス提供部状況監視部１２１は、サービス提供部状態テーブル１２３を更新する（Ｓ６０２´）。具体的には、読み込んだ動作履歴を順番にチェックし、ホスト名２０３の名称毎およびサービス提供部名２０４毎に分類する。そして、サービス提供部毎の動作履歴の内容２０５を参照する。開始"start"に対応する終了"end"が記録されている動作履歴がある場合、"0"秒を処理経過時間４０４に格納する。開始"start"のみが記録されている場合、当該動作履歴の日付２０１及び時刻２０２と、現在日時の差分から求めた時間を処理経過時間４０４に格納する。 After S601, the service providing unit status monitoring unit 121 updates the service providing unit state table 123 (S602 ′). Specifically, the read operation history is checked in order, and is classified into each name of the host name 203 and each service providing unit name 204. Then, the content 205 of the operation history for each service providing unit is referred to. When there is an operation history in which an end “end” corresponding to the start “start” is recorded, “0” seconds are stored in the process elapsed time 404. When only the start “start” is recorded, the time obtained from the difference between the date 201 and time 202 of the operation history and the current date is stored in the process elapsed time 404.

次に、サービス提供部状況監視部１２１は、Ｓ６０２´で更新したサービス提供部状態テーブル１２３の各サービス提供部１１２の処理経過時間４０４が、フェイルオーバ指示ルールテーブル１２４の各指示ルール情報の処理経過時間５０２と一致するか判定し、一致する指示ルール情報を選択する（Ｓ６０４´）。以降、サービス提供部状況監視部１２１は、Ｓ６０５、Ｓ６０６の処理を実行する。 Next, the service providing unit status monitoring unit 121 determines that the processing elapsed time 404 of each service providing unit 112 in the service providing unit state table 123 updated in S602 ′ is the processing elapsed time of each instruction rule information in the failover instruction rule table 124. It is determined whether it matches 502, and the matching instruction rule information is selected (S604 ′). Thereafter, the service providing unit status monitoring unit 121 executes the processes of S605 and S606.

以上、第１の実施形態の変形例について説明した。この変形例によれば、クラスタシステムにおいて、現用系サーバ計算機上でサーバプログラムのうちの少なくとも１つが性能要求となる限界処理時間内に処理を終了していない場合に、待機系サーバ計算機上へフェイルオーバが実行される。これにより、サーバプログラムに高負荷状態が発生した場合に、待機系サーバ計算機上へフェイルオーバが実行されるため、サーバプログラムの応答性の低下に対処ができる。 In the above, the modification of 1st Embodiment was demonstrated. According to this modification, in the cluster system, when at least one of the server programs on the active server computer has not finished processing within the limit processing time required for performance, the failover is performed on the standby server computer. Is executed. As a result, when a high load state occurs in the server program, a failover is executed on the standby server computer, so it is possible to cope with a decrease in the responsiveness of the server program.

以上、本発明について、例示的な実施形態と関連させて記載した。多くの代替物、修正および変形例が当業者にとって明らかであることは明白である。したがって、上に記載の本発明の実施形態は、本発明の要旨と範囲を例示することを意図し、限定するものではない。 The present invention has been described in connection with exemplary embodiments. Obviously, many alternatives, modifications, and variations will be apparent to practitioners skilled in this art. Accordingly, the above-described embodiments of the present invention are intended to illustrate and not limit the gist and scope of the present invention.

第１の実施形態に係るシステム構成、及び、サーバ計算機の機能構成を示すブロック図。The block diagram which shows the system configuration | structure which concerns on 1st Embodiment, and the function structure of a server computer. 第１の実施形態に係るサービス提供部動作履歴管理テーブルの構成を説明するための図。The figure for demonstrating the structure of the service provision part operation | movement log | history management table which concerns on 1st Embodiment. 第１の実施形態に係るサービス提供部特性テーブルの構成を説明するための図。The figure for demonstrating the structure of the service provision part characteristic table which concerns on 1st Embodiment. 第１の実施形態に係るサービス提供部状態テーブルの構成を説明するための図。The figure for demonstrating the structure of the service provision part state table which concerns on 1st Embodiment. 第１の実施形態に係るフェイルオーバ指示ルールテーブルの構成を説明するための図。The figure for demonstrating the structure of the failover instruction | indication rule table which concerns on 1st Embodiment. 第１の実施形態に係るサービス提供部状況監視部の処理の流れを示すフロー図。The flowchart which shows the flow of a process of the service provision part condition monitoring part which concerns on 1st Embodiment. 第１の実施形態に係るサーバ計算機のハードウェア構成を示すブロック図。The block diagram which shows the hardware constitutions of the server computer which concerns on 1st Embodiment. 第１の実施形態の変形例に係るサービス提供部状態テーブルの構成を説明するための図。The figure for demonstrating the structure of the service provision part state table which concerns on the modification of 1st Embodiment. 第１の実施形態の変形例に係るフェイルオーバ指示ルールテーブルの構成を説明するための図。The figure for demonstrating the structure of the failover instruction | indication rule table which concerns on the modification of 1st Embodiment. 第１の実施形態の変形例に係るサービス提供部状況監視部の処理の流れを示すフロー図。The flowchart which shows the flow of a process of the service provision part condition monitoring part which concerns on the modification of 1st Embodiment.

Explanation of symbols

１０１・・・クラスタシステム、１０２・・・サーバ計算機、１０４・・・クライアント計算機、１０５・・・クライアント計算機、１０６・・・クライアント計算機、１０７・・・ＬＡＮ、１０８・・・ＬＡＮ、１１１・・・クラスタ処理部、１１２・・・サービス提供部、１１３・・・サービス提供部動作履歴テーブル、１２２・・・サービス提供部特性テーブル、１２３・・・サービス提供部状態テーブル、１２４・・・フェイルオーバ指示ルールテーブル、１５０・・・制御部、１５２・・・記憶部、
２０１・・・日付、２０２・・・時刻、２０３・・・ホスト名、２０４・・・サービス提供部名、２０５・・・内容、２１０〜２４０・・・動作履歴、
３０１・・・サービス提供部名、３０３・・・平均処理時間、３０４・・・限界処理時間、３１０〜３２０・・・特性情報、
４０１・・・サービス提供部名、４０３・・・生死状態、４０４・・・処理経過時間、４１０〜４２０・・・状態情報、
５０１・・・生死状態、５０２・・・処理経過時間、５０３・・・ルール、５０４〜５０５・・・サービス提供部毎の生死状態、５０６〜５０７・・・サービス提供部毎の処理経過時間、５０８・・・条件、５０９・・・実行内容、５１１〜５４１・・・指示ルール情報、
１１・・・ＣＰＵ、１２・・・主記憶装置、１３・・・補助記憶装置、１４・・・入力装置、１５・・・出力装置、１６・・・読み取り装置、１７・・・通信装置。 DESCRIPTION OF SYMBOLS 101 ... Cluster system, 102 ... Server computer, 104 ... Client computer, 105 ... Client computer, 106 ... Client computer, 107 ... LAN, 108 ... LAN, 111 ... Cluster processing unit, 112 ... service providing unit, 113 ... service providing unit operation history table, 122 ... service providing unit characteristic table, 123 ... service providing unit state table, 124 ... failover instruction Rule table, 150 ... control unit, 152 ... storage unit,
201 ... Date, 202 ... Time, 203 ... Host name, 204 ... Service provider name, 205 ... Contents, 210-240 ... Operation history,
301 ... Service providing unit name, 303 ... Average processing time, 304 ... Limit processing time, 310-320 ... Characteristic information,
401 ... Service providing unit name, 403 ... Life / death state, 404 ... Process elapsed time, 410-420 ... Status information,
501 ... Life / death state, 502 ... Process elapsed time, 503 ... Rule, 504 to 505 ... Life / death state for each service providing unit, 506 to 507 ... Process elapsed time for each service providing unit, 508 ... condition, 509 ... execution content, 511-541 ... instruction rule information,
DESCRIPTION OF SYMBOLS 11 ... CPU, 12 ... Main memory, 13 ... Auxiliary memory, 14 ... Input device, 15 ... Output device, 16 ... Reading device, 17 ... Communication device.

Claims

A server constituting a cluster system having a plurality of servers each including a failover processing unit that performs a failover process of taking over the provision of the service from one server that provides the service to another server by receiving a failover execution instruction Because
A storage unit for storing state information for specifying a state of a processing state for performing a failover;
A status monitoring unit that identifies the processing status when providing the service;
A failover execution instruction unit that instructs the failover processing unit to execute failover when the processing status specified by the status monitoring unit matches the processing status specified by the status information;
The server characterized by having.

The server according to claim 1,
In the storage unit, condition information for specifying a condition for performing failover is stored corresponding to each state information,
The processing status specified by the status monitoring unit matches the processing status specified by the status information, and the processing status specified by the status monitoring unit satisfies the condition information corresponding to the status information determined to match. The failover execution instruction unit instructs the failover processing unit to execute failover,
A server characterized by

The server according to claim 1,
The processing status is a processing time of a predetermined process,
The state information is information for specifying that a processing time of a predetermined process has exceeded a predetermined limit time;
A server characterized by

The server according to claim 2,
The processing status is a processing time of a predetermined process,
The state information is information for specifying that the processing time of a predetermined process has exceeded a predetermined limit time,
The condition information indicates that processing other than the predetermined processing has been completed and that the processing time of processing other than the predetermined processing has exceeded a predetermined limit time. Information identifying at least one of
A server characterized by

The server according to claim 2,
The processing status is an operation state indicating operation or stop of a predetermined process,
The state information is information for specifying that an operation state of a predetermined process is stopped,
The condition information indicates that processing other than the predetermined processing has been completed and that the processing time of processing other than the predetermined processing has exceeded a predetermined limit time. Information identifying at least one of
A server characterized by

A server constituting a cluster system having a plurality of servers each including a failover processing unit that performs a failover process of taking over the provision of the service from one server that provides the service to another server by receiving a failover execution instruction The failover execution method in
The server
Stores status information that identifies the status of the processing status for failover,
A status monitoring step for identifying the processing status when providing the service;
A failover execution instruction step for instructing the failover processing unit to execute a failover when the processing status specified in the status monitoring step matches the processing status specified in the status information;
A failover execution method characterized in that is executed.

The failover execution method according to claim 6, wherein
The server
The condition information for specifying the condition for performing the failover for each state information is stored in association with each other,
The failover execution instruction step includes:
The processing status specified by the status monitoring unit matches the processing status specified by the status information, and the processing status specified by the status monitoring unit satisfies the condition information corresponding to the status information determined to match. Instructing the failover processing unit to execute failover,
Failover execution method characterized by this.

The failover execution method according to claim 6, wherein
The processing status is a processing time of a predetermined process,
The state information is information for specifying that a processing time of a predetermined process has exceeded a predetermined limit time;
Failover execution method characterized by this.

The failover execution method according to claim 7, comprising:
The processing status is a processing time of a predetermined process,
The state information is information for specifying that the processing time of a predetermined process has exceeded a predetermined limit time,
The condition information indicates that processing other than the predetermined processing has been completed and that the processing time of processing other than the predetermined processing has exceeded a predetermined limit time. Information identifying at least one of
Failover execution method characterized by this.

The failover execution method according to claim 7, comprising:
The processing status is an operation state indicating operation or stop of a predetermined process,
The state information is information for specifying that an operation state of a predetermined process is stopped,
The condition information indicates that processing other than the predetermined processing has been completed and that the processing time of processing other than the predetermined processing has exceeded a predetermined limit time. Information identifying at least one of
Failover execution method characterized by this.

A cluster system having a plurality of servers each provided with a failover processing unit that performs failover processing to take over the provision of the service from one server that provides the service to another server by receiving a failover execution instruction for the computer A program for functioning as a server to be configured,
The computer,
A storage unit for storing state information for specifying a state of a processing state for performing a failover;
A status monitoring unit that identifies the processing status when providing the service;
When the processing status specified by the status monitoring unit matches the processing status specified by the status information, as a failover execution instruction unit that instructs the failover processing unit to execute failover,
A program characterized by functioning.

The program according to claim 11,
In the storage unit, condition information for specifying a condition for performing failover is stored corresponding to each state information,
The processing status specified by the status monitoring unit matches the processing status specified by the status information, and the processing status specified by the status monitoring unit satisfies the condition information corresponding to the status information determined to match. The failover execution instruction unit instructs the failover processing unit to execute failover,
A program characterized by

The program according to claim 11,
The processing status is a processing time of a predetermined process,
The state information is information for specifying that a processing time of a predetermined process has exceeded a predetermined limit time;
A program characterized by

A program according to claim 12,
The processing status is a processing time of a predetermined process,
The state information is information for specifying that the processing time of a predetermined process has exceeded a predetermined limit time,
The condition information indicates that processing other than the predetermined processing has been completed and that the processing time of processing other than the predetermined processing has exceeded a predetermined limit time. Information identifying at least one of
A program characterized by

A program according to claim 12,
The processing status is an operation state indicating operation or stop of a predetermined process,
The state information is information for specifying that an operation state of a predetermined process is stopped,
The condition information indicates that processing other than the predetermined processing has been completed and that the processing time of processing other than the predetermined processing has exceeded a predetermined limit time. Information identifying at least one of
A program characterized by