JP2003067266A

JP2003067266A - Server system and fault detection method for client server system

Info

Publication number: JP2003067266A
Application number: JP2001256526A
Authority: JP
Inventors: Takeshi Mishima; 健三島; Etsuo Masuda; 悦夫増田; Akito Fujita; 昭人藤田; Yoichi Hijikata; 陽一土方
Original assignee: NTT Advanced Technology Corp; Nippon Telegraph and Telephone Corp
Current assignee: NTT Advanced Technology Corp; Nippon Telegraph and Telephone Corp
Priority date: 2001-08-27
Filing date: 2001-08-27
Publication date: 2003-03-07

Abstract

PROBLEM TO BE SOLVED: To provide a high fault-tolerant server system with simple and inexpensive configuration and a failure detection method thereof. SOLUTION: A control part 30 transfers a request for processing from a client 40 to servers 21, 22 mounted on an information processor 91 and a server 23 mounted on an information processor 92, determines the validity of the response results from respective servers, and transfers one of the valid response results to the client 40. The information processors 91, 92 are provided with server monitors 71, 72 monitoring actions of servers, and system monitors 81, 82 monitoring actions of the information processors themselves respectively. The control part 30 isolates the fault server from the system based on the output results from respective monitors.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、長時間の連続運転
が要求されるメールサーバ・Ｗｅｂサーバ・ＥＣサーバ
などのインターネットビジネス用システムや通信網のノ
ードシステムなどにサーバシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a server system for an Internet business system such as a mail server, a Web server, an EC server, or a node system of a communication network, which requires continuous operation for a long time.

【０００２】[0002]

【従来の技術】この種のサーバシステムは、サーバ自体
やサーバが実装された情報処理装置の障害を検出し、サ
ーバシステム全体が障害となることを防止する必要があ
る。ここで「障害」とはシステムの機能損失を意味す
る。「障害」の発生要因の具体例としては例えばハード
ウェアにおけるバグや故障，ソフトウェアにおけるバグ
などが挙げられる。そして、これら「障害」の原因を
「フォールト」と呼ぶ。仮にシステムに「フォールト」
が存在していても、そのことが直ちに「障害」を引き起
こすとは限らず潜在することもある。「フォールト」が
原因で異常が表面化することを「誤り（Ｅｒｒｏｒ）」
という。そして「誤り」が発生したシステムが正常な状
態から逸脱すると「障害」が発生する。なお、「誤り」
はバグや故障などの狭義の「フォールト」のみが原因で
はなく、例えば間欠故障やオペレータの設定ミスや操作
ミスなども「誤り」の原因となる。2. Description of the Related Art In this type of server system, it is necessary to detect a failure of the server itself or an information processing apparatus in which the server is mounted, and prevent the entire server system from becoming a failure. Here, “failure” means loss of system function. Specific examples of the cause of the "fault" include bugs and failures in hardware, and bugs in software. The cause of these "faults" is called "fault". Temporarily "fault" the system
The existence of a "," does not necessarily cause an "impairment" immediately, but it may be latent. "Error" means that anomalous surface is caused by "fault"
Say. When the system in which the "error" has occurred deviates from the normal state, the "failure" occurs. "Error"
Is not only caused by a narrowly defined “fault” such as a bug or failure, but also an “error” is caused by, for example, an intermittent failure, operator's setting error or operation error.

【０００３】従来、耐障害性に優れた高信頼性システム
の一例として、「フォールトトレランスサーバＦＴ６１
００の高信頼化技術」，電子情報通信学会，ＦＴＳ９３
−５６，１９９６年に記載されたものが提案されてい
る。この高信頼性システムの一例について図１８を参照
して説明する。Conventionally, as an example of a highly reliable system having excellent fault tolerance, a "fault tolerance server FT61" is used.
High reliability technology of 00 ”, IEICE, FTS93
The one described in -56, 1996 is proposed. An example of this high reliability system will be described with reference to FIG.

【０００４】図１８に示すように、この高信頼性システ
ムは、それぞれ同じスペックを有する三つのＭＰＵ（Ｍ
ｉｃｒｏＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１
０，１０２０，１０３０と、各ＭＰＵ１０１０，１０２
０，１０３０に接続されたＭＰＵ監視装置１１１０と、
クロック供給回路１２１０とを備えている。As shown in FIG. 18, this high-reliability system has three MPUs (M
micro Processing Unit) 101
0, 1020, 1030 and each MPU 1010, 102
0,1030 connected to the MPU monitoring device 1110,
And a clock supply circuit 1210.

【０００５】ＭＰＵ１０１０は、メモリ１０１１及びハ
ードディスク１０１２に対してシステムバス１０１３を
介して接続しており、これらの機器で一つの系を構成し
ている。同様に、ＭＰＵ１０２０は、メモリ１０２１及
びハードディスク１０２２に対してシステムバス１０２
３を介して接続しており、これらの機器で一つの系を構
成している。The MPU 1010 is connected to the memory 1011 and the hard disk 1012 via a system bus 1013, and these devices form one system. Similarly, the MPU 1020 has a system bus 102 for the memory 1021 and the hard disk 1022.
3 are connected to each other, and these devices form one system.

【０００６】システムバス１０１３のＭＰＵ１０１０側
にはバッファ１３１０が設けられている。このバッファ
１３１０は、ＭＰＵ１０１０がフォールトになった場合
に、ハイインピーダンス状態（即ち論理的に切断された
状態）にされる。これにより、システムバス１０１３が
ＭＰＵ１０１０から切り離される。同様に、システムバ
ス１０２３のＭＰＵ１０３０側にはバッファ１３２０が
設けられている。このバッファ１３２０は、ＭＰＵ１０
２０がフォールトになった場合に、ハイインピーダンス
状態（即ち論理的に切断された状態）にされる。これに
より、システムバス１０２３がＭＰＵ１０２０から切り
離される。A buffer 1310 is provided on the MPU 1010 side of the system bus 1013. The buffer 1310 is brought into a high impedance state (that is, a logically disconnected state) when the MPU 1010 becomes a fault. As a result, the system bus 1013 is disconnected from the MPU 1010. Similarly, a buffer 1320 is provided on the MPU 1030 side of the system bus 1023. This buffer 1320 is the MPU 10
When 20 becomes a fault, it is brought to a high impedance state (that is, a logically disconnected state). As a result, the system bus 1023 is disconnected from the MPU 1020.

【０００７】一方の系のシステムバス１０１３と他方の
系のシステムバス１０２３とはバッファ１３３０を介し
て相互に接続している。このバッファ１３３０は、二つ
の系のうち一方の系でフォールトが発生した場合に他方
の系より正常な信号を取り入れるためのものである。The system bus 1013 of one system and the system bus 1023 of the other system are connected to each other via a buffer 1330. The buffer 1330 is for taking in a normal signal from the other system when a fault occurs in one of the two systems.

【０００８】一方の系に属するハードディスク１０１２
は、他方の系のシステムバス１０２３と接続しており、
ＭＰＵ１０２０からアクセス可能となっている。同様
に、他方の系に属するハードディスク１０２２は、一方
の系のシステムバス１０１３と接続しており、ＭＰＵ１
０１０からアクセス可能となっている。Hard disk 1012 belonging to one system
Is connected to the system bus 1023 of the other system,
It is accessible from the MPU 1020. Similarly, the hard disk 1022 belonging to the other system is connected to the system bus 1013 of the one system, and the MPU 1
It is accessible from 010.

【０００９】クロック供給回路１２１０は、本システム
の各要素、即ちＭＰＵ１０１０，１０２０，１０３０、
ＭＰＵ監視装置１１１０、メモリ１０１１，１０２１、
ハードディスク１０１２，１０２２、システムバス１０
１３，１０２３、バッファ１３１０，１３２０，１３３
０などに接続しており、これらの各要素を同期して動作
させるために共通の信号を供給する。なお、図１８にお
いては、図面の複雑化を避けるために、クロック供給回
路１２１０から他の各要素へクロック信号を分配するた
めの配線については図示を省略した。The clock supply circuit 1210 is provided for each element of this system, that is, the MPUs 1010, 1020, 1030,
MPU monitoring device 1110, memories 1011, 1021,
Hard disk 1012, 1022, system bus 10
13, 1023, buffers 1310, 1320, 133
0, etc., and supplies a common signal to operate these elements in synchronization. Note that in FIG. 18, wirings for distributing a clock signal from the clock supply circuit 1210 to each of the other elements are not shown in order to avoid complication of the drawing.

【００１０】この高信頼性システムの動作について説明
する。ＭＰＵ１０１０，１０２０，１０３０はクロック
供給回路１２１０からのクロック信号に基づき動作し、
互いに同期し且つ同一の処理を行っている。The operation of this high reliability system will be described. The MPUs 1010, 1020, 1030 operate based on the clock signal from the clock supply circuit 1210,
They are synchronized with each other and perform the same processing.

【００１１】ＭＰＵ監視装置１１１０は、一マシンサイ
クルごとにＭＰＵ１０１０，１０２０，１０３０の出力
を比較し、いわゆる「多数決処理」を行う。この多数決
処理とは、複数の信号を互いに比較し、最も多く出力さ
れた信号を正常信号として取り扱うものである。ここで
は、ＭＰＵ監視装置１１１０は、ＭＰＵ１０１０，１０
２０，１０３０の出力が全て同じ場合、各ＭＰＵにフォ
ールトはないと判定して、そのまま処理を継続する。一
方、ＭＰＵ１０１０の出力が他の二つのＭＰＵ１０２
０，１０３０と異なる場合、ＭＰＵ監視装置１１１０は
ＭＰＵ１０１０が故障したと判断し、バッファ１３１０
をハイインピーダンスとするとともに、バッファ１３３
０を開放する。他方、ＭＰＵ１０２０の出力が他の二つ
のＭＰＵ１０１０，１０３０と異なる場合、ＭＰＵ監視
装置１１１０はＭＰＵ１０２０が故障したと判断し、バ
ッファ１３２０をハイインピーダンスとするとともに、
バッファ１３３０を開放する。さらに、ＭＰＵ１０３０
の出力が他の二つのＭＰＵ１０１０，１０２０と異なる
場合、ＭＰＵ監視装置１１１０はＭＰＵ１０３０が故障
したと判断するが、ＭＰＵ１０３０は多数決処理のため
だけに動作しているＭＰＵなので、バッファ１３１０，
１３２０，１３３０の制御は行わない。The MPU monitoring device 1110 compares the outputs of the MPUs 1010, 1020, 1030 for each machine cycle, and performs so-called "majority decision processing". In this majority decision processing, a plurality of signals are compared with each other, and the most output signal is treated as a normal signal. Here, the MPU monitoring device 1110 is configured so that the MPUs 1010, 10
When the outputs of 20 and 1030 are all the same, it is determined that there is no fault in each MPU, and the processing is continued. On the other hand, the output of the MPU 1010 is the other two MPU 102.
If it is different from 0, 1030, the MPU monitoring device 1110 determines that the MPU 1010 has failed, and the buffer 1310
Is set to high impedance and the buffer 133
0 is released. On the other hand, when the output of the MPU 1020 differs from the other two MPUs 1010 and 1030, the MPU monitoring device 1110 determines that the MPU 1020 has failed, sets the buffer 1320 to high impedance, and
The buffer 1330 is released. Furthermore, MPU1030
, The MPU monitoring device 1110 determines that the MPU 1030 has failed. However, since the MPU 1030 is an MPU operating only for majority processing, the buffer 1310,
The control of 1320 and 1330 is not performed.

【００１２】[0012]

【発明が解決しようとする課題】上述のように、従来の
高信頼性システムでは、ＭＰＵを複数台設置し、それら
を同一のクロック供給回路を用いて同期動作させ、各々
の出力をモニタする監視装置により障害を検出するとい
う方式を採用してきた。また、障害を検出した場合に
は、システムバスをハイインピーダンスにし、正常系だ
けで動作できるようにする仕組みを備えていた。すなわ
ち、従来の高信頼性システムは、汎用コンピュータとは
全く異なるハードウェア構成を実現する必要があること
から、用途・目的を考慮して専用のシステムを構築する
必要があった。As described above, in the conventional high-reliability system, a plurality of MPUs are installed, they are synchronously operated by using the same clock supply circuit, and each output is monitored. A method of detecting a failure by a device has been adopted. Further, when a failure is detected, the system bus is provided with a high impedance so that only the normal system can operate. That is, the conventional high-reliability system needs to realize a hardware configuration that is completely different from that of a general-purpose computer, and therefore it is necessary to construct a dedicated system in consideration of the purpose and purpose.

【００１３】このため、従来の高信頼性システムは、（１）コストが高くなる（２）次々に市場に登場する新製品（高機能プロセッサ
・大容量メモリなど）を利用した高性能な高信頼性シス
テムの開発期間が長くなる（３）クロック信号を高信頼性システム全域に分配する
必要があるため、クロックの配線長の増大による配線遅
延がネックとなり、高信頼性システムの高性能化の実現
が困難である。また、クロック供給回路を実装する位置
により系間でクロックの位相ずれが生じ得るため、位相
合わせのための回路が必要となる。その結果、回路設計
が難しく複雑な構成となってしまう（４）ハードウェアレベルの情報のみで障害検出を行う
ため、ソフトウェアの障害を見つけにくいという問題点があった。For this reason, the conventional high-reliability system is (1) high in cost (2) high-performance and high-reliability utilizing new products (high-performance processor, large-capacity memory, etc.) appearing in the market one after another. (3) Since it is necessary to distribute the clock signal throughout the high-reliability system, wiring delay due to an increase in the clock wiring length becomes a bottleneck, and high performance of the high-reliability system is realized. Is difficult. In addition, a clock phase shift may occur between systems depending on the position where the clock supply circuit is mounted, and thus a circuit for phase matching is required. As a result, the circuit design becomes difficult and the configuration becomes complicated. (4) Since the failure detection is performed only by the hardware level information, there is a problem that it is difficult to find the software failure.

【００１４】本発明は、上記事情に鑑みてなされたもの
であり、その目的とするところは、容易且つ安価な構成
で耐障害性の高いサーバシステム及びその障害検出方法
を提供することにある。The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a server system having a simple and inexpensive structure and high fault tolerance, and a fault detection method therefor.

【００１５】[0015]

【課題を解決するための手段】上記目的を達成するため
に、本発明では、クライアントサーバ型システムにおけ
るサーバ側のシステムにおいて、一以上の情報処理装置
上に実装された一以上のサーバと、サーバの動作を監視
するサーバ監視手段とを備えるとともに、クライアント
とサーバとの間に、クライアントからの処理要求をサー
バに転送する処理要求転送手段と、処理要求に応じてサ
ーバから受信した応答結果を保持する応答結果保持手段
と、保持した応答結果の正当性を判定する正当性判定手
段と、正当と判定された応答結果のみを要求元のクライ
アントに転送する応答結果転送手段とを有する仲介手段
を介在させたことを特徴とする。In order to achieve the above object, according to the present invention, in a server-side system of a client-server type system, one or more servers mounted on one or more information processing devices, and a server Server monitoring means for monitoring the operation of the server, processing request transfer means for transferring a processing request from the client to the server between the client and the server, and a response result received from the server in response to the processing request. Mediation means having a response result holding means, a validity judgment means for judging the validity of the held response result, and a response result transfer means for transferring only the response result judged to be correct to the requesting client. It is characterized by having done.

【００１６】本発明によれば、仲介手段により、クライ
アントからの処理要求がサーバに転送される。そして、
サーバからの応答結果に対して正当性が判定され、正当
な応答結果のみがクライアントに転送される。これによ
り、クライアントに対してサーバにおける「誤り」を隠
蔽できる。また、サーバ監視手段によりサーバの動作が
監視できるので、サーバに障害が発生しても該サーバを
システムから切り離すなどの処理を行うことによりシス
テム全体として耐障害性に優れたものとなる。このよう
に、サーバに特別な構成を必要とすることなく誤りの隠
蔽ができるので、安価な汎用コンピュータを用いた障害
耐性に優れたシステムの構築が可能となる。また、ソフ
トウェアやハードウェアの高性能化によるサーバの高性
能化を図り、これによりシステム全体の高性能化を容易
に行うことができる。さらに、前述したクロック配線遅
延問題や位相合わせ問題も解決できるため高性能なシス
テムを構築できる。According to the present invention, the mediation means transfers the processing request from the client to the server. And
The validity of the response result from the server is determined, and only the valid response result is transferred to the client. This makes it possible to hide the “error” in the server from the client. Further, since the operation of the server can be monitored by the server monitoring means, even if a failure occurs in the server, by performing processing such as disconnecting the server from the system, the entire system becomes excellent in fault tolerance. In this way, since errors can be concealed without requiring a special configuration in the server, it is possible to construct a system with excellent fault tolerance using an inexpensive general-purpose computer. In addition, the performance of the server can be improved by improving the performance of software and hardware, and thus the performance of the entire system can be easily improved. Further, since the clock wiring delay problem and the phase matching problem described above can be solved, a high performance system can be constructed.

【００１７】なお、ここでサーバとは、クライアントに
対して所定のサービスを提供する手段として情報処理装
置を機能させるプログラムを意味する。Here, the server means a program that causes the information processing apparatus to function as means for providing a client with a predetermined service.

【００１８】本発明の好適な態様の一例として、前記サ
ーバ監視手段は、情報処理装置においてサーバが正常に
動作しているか否かを監視するとともにサーバの動作異
常があった場合にサーバ障害を前記仲介手段に通知する
第１の監視手段を備えたものを提案する。As an example of a preferred aspect of the present invention, the server monitoring means monitors whether or not the server is operating normally in the information processing device, and when there is an abnormal operation of the server, the server failure is detected. It is proposed to have a first monitoring means for notifying the intermediary means.

【００１９】また、本発明の好適な態様の他の例とし
て、前記サーバ監視手段は、情報処理装置上に実装され
且つ仲介手段に対して所定周期で生存確認メッセージを
通知する第２の監視手段を備えたものを提案する。As another example of a preferred aspect of the present invention, the server monitoring means is a second monitoring means which is mounted on the information processing apparatus and which notifies the intermediary means of a survival confirmation message at a predetermined cycle. Propose one with.

【００２０】[0020]

【発明の実施の形態】本発明の一実施の形態に係る高信
頼性サーバシステムについて図面を参照して説明する。
図１は高信頼性サーバシステムの全体構成を示すブロッ
ク図である。本実施の形態ではクライアントに対してＷ
ｅｂサービスを提供するシステムについて例示する。ま
た、本実施の形態ではシステム構成するＯＳとしてＬｉ
ｎｕｘを用いた。さらに、Ｗｅｂサービスを提供するサ
ーバプログラムとしてはＡｐａｃｈｅＳｏｆｔｗａｒ
ｅＦｏｕｎｄａｔｉｏｎのｈｔｔｐサーバを用いた。BEST MODE FOR CARRYING OUT THE INVENTION A highly reliable server system according to an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the overall configuration of a high reliability server system. In this embodiment, the W
A system for providing an eb service will be exemplified. In addition, in the present embodiment, the system configuration OS is Li
nux was used. In addition, Apache Software is a server program that provides Web services.
e Foundation's http server was used.

【００２１】このサーバシステム１０は、図１に示すよ
うに、高信頼性サーバ２０を構成するための情報処理装
置９１，９２と、高信頼性サーバ２０とクライアント４
０との要求・応答を中継する情報処理装置９３とで構成
される。As shown in FIG. 1, the server system 10 includes information processing devices 91 and 92 for constructing a highly reliable server 20, a highly reliable server 20 and a client 4.
The information processing device 93 relays a request / response with respect to 0.

【００２２】高信頼性サーバ２０は、一つ以上のサーバ
で構成される。図１の例では、高信頼性サーバ２０は、
三つのサーバ２１，２２，２３で構成される。各サーバ
２１，２２，２３はそれぞれ同じ要求に対しては同じ応
答を返す。The high reliability server 20 is composed of one or more servers. In the example of FIG. 1, the high reliability server 20 is
It is composed of three servers 21, 22, 23. Each server 21, 22, 23 returns the same response to the same request.

【００２３】また、高信頼性サーバ２０は、一つ以上の
情報処理装置で構成される。図１の例では、高信頼性サ
ーバ２０は、二つの情報処理装置９１と９２で構成され
る。情報処理装置９１にはサーバ２１及び２２が実装さ
れ、情報処理装置９２にはサーバ２３が実装されてい
る。The high reliability server 20 is composed of one or more information processing devices. In the example of FIG. 1, the high reliability server 20 is composed of two information processing devices 91 and 92. The servers 21 and 22 are mounted on the information processing apparatus 91, and the server 23 is mounted on the information processing apparatus 92.

【００２４】高信頼性サーバ２０を構成する各情報処理
装置には、サーバの他に、サーバの生存を確認するサー
バモニタと、サーバが実装された情報処理装置の生存を
確認するために定期的にａｌｉｖｅメッセージを送信す
るシステムモニタが実装されている。図１の例では、情
報処理装置９１にはサーバモニタ７１とシステムモニタ
８１が実装され、情報処理装置９２にはサーバモニタ７
２とシステムモニタ８２が実装されている。In addition to the server, each information processing apparatus that constitutes the high reliability server 20 has a server monitor for checking the existence of the server and a periodical operation for checking the existence of the information processing apparatus in which the server is installed. A system monitor that sends an alive message is implemented. In the example of FIG. 1, a server monitor 71 and a system monitor 81 are mounted on the information processing device 91, and the server monitor 7 is mounted on the information processing device 92.
2 and a system monitor 82 are mounted.

【００２５】情報処理装置９３には制御部３０とメッセ
ージ受信部３１が配置されている。制御部３０は、サー
バ管理表６０を備えており、このサーバ管理表６０に各
サーバの状態（例えば「正常」若しくは「障害」）を登
録する。図２にサーバ管理表の一例を示す。図２に示す
ように、サーバ管理表６０は、サーバを識別するための
サーバＩＤ，サーバのＩＰアドレス，ポート番号，稼働
している情報処理装置の識別子，サーバの稼働状態情報
などから構成される。制御部３０は、このサーバ管理表
６０を参照することにより、どのサーバにクライアント
からの要求を処理させるかを判断する。制御部３０とサ
ーバ２１，２２，２３間は、クライアントからの要求が
あった際に、後述する手順でＴＣＰ接続する。A control unit 30 and a message receiving unit 31 are arranged in the information processing device 93. The control unit 30 includes a server management table 60, and the state (eg, “normal” or “fault”) of each server is registered in the server management table 60. FIG. 2 shows an example of the server management table. As shown in FIG. 2, the server management table 60 is composed of a server ID for identifying the server, an IP address of the server, a port number, an identifier of an operating information processing device, operating state information of the server, and the like. . The control unit 30 refers to the server management table 60 to determine which server should process the request from the client. When there is a request from a client, the control unit 30 and the servers 21, 22, 23 are connected by TCP according to the procedure described later.

【００２６】メッセージ受信部３１は、定期的にシステ
ムモニタ８１及び８２からａｌｉｖｅメッセージを受け
取る。また、メッセージ受信部３１は、サーバに障害が
発生した場合にサーバモニタ７１又は７２から送られて
くる障害通知メッセージも受信する。メッセージ受信部
３１とサーバモニタ７１，７２間、さらにメッセージ受
信部３１とシステムモニタ８１，８２間は、予めＴＣＰ
接続しておく。The message receiving section 31 periodically receives an alive message from the system monitors 81 and 82. The message receiving unit 31 also receives a failure notification message sent from the server monitor 71 or 72 when a failure occurs in the server. Between the message receiving unit 31 and the server monitors 71 and 72, and between the message receiving unit 31 and the system monitors 81 and 82, TCP is used in advance.
Keep connected.

【００２７】クライアント４０は、適当なブラウザを使
って、ネットワーク５０を経由し、高信頼性サーバ２０
に対してＷｅｂページなどの要求を行う。The client 40 uses the appropriate browser to access the high reliability server 20 via the network 50.
Request a Web page or the like.

【００２８】次に、制御部３０の動作について図３を参
照して説明する。図３は制御部３０の処理フローであ
る。Next, the operation of the control unit 30 will be described with reference to FIG. FIG. 3 is a processing flow of the control unit 30.

【００２９】制御部３０は、メッセージ受信部３１から
障害通知があるかどうかをチェックし（ステップＳＡ
１）、障害通知がある場合、さらにサーバモニタ７１又
は７２からのメッセージであるかどうかを判断する（ス
テップＳＡ２）。サーバモニタ７１又は７２からのメッ
セージであった場合、サーバ管理表６０から、当該サー
バを障害と登録しそのサーバを切り離す（ステップＳＡ
３）。サーバモニタ７１又は７２からのメッセージでな
い場合、システムモニタ８１又は８２からのａｌｉｖｅ
メッセージが途絶えたので、サーバ管理表６０に、当該
情報処理装置上のサーバを障害と登録し、そのサーバを
切り離す（ステップＳＡ５）。The control unit 30 checks whether there is a failure notification from the message receiving unit 31 (step SA
1) If there is a failure notification, it is further determined whether the message is from the server monitor 71 or 72 (step SA2). When the message is from the server monitor 71 or 72, the server is registered as a failure in the server management table 60 and the server is disconnected (step SA).
3). If the message is not from the server monitor 71 or 72, the alive from the system monitor 81 or 82
Since the message is lost, the server on the information processing device is registered as a failure in the server management table 60, and the server is disconnected (step SA5).

【００３０】制御部３０は、メッセージ受信部３１から
障害通知がないと判断した場合（ステップＳＡ１）、ｗ
ｅｌｌ−ｋｎｏｗｎポート（Ｗｅｂサービスなので８０
番ポート）をチェックし、クライアントからの要求があ
るかどうかを確認する（ステップＳＡ６）。When the control unit 30 determines that there is no failure notification from the message receiving unit 31 (step SA1), w
ell-known port (80 because it is a Web service)
Port number) to check whether there is a request from the client (step SA6).

【００３１】制御部３０は、クライアントからの要求が
あった場合（ステップＳＡ６）、ＴＣＰ接続を行い、子
制御プロセスを生成する（ステップＳＡ７）。子制御プ
ロセスの生成が完了すると、再び障害通知のチェックを
行う（ステップＳＡ１）。When there is a request from the client (step SA6), the control unit 30 makes a TCP connection and creates a child control process (step SA7). When the generation of the child control process is completed, the fault notification is checked again (step SA1).

【００３２】例として、クライアント４０がサーバシス
テム１０に対して要求を出し、その要求を制御部３０が
受け取り、子制御プロセス３２が生成された状態を図４
に示す。なお、図４では図面の繁雑を避けるため、サー
バモニタ７１，７２、システムモニタ８１，８２、情報
処理装置９１，９２、メッセージ受信部３１の図示は省
略した。As an example, the state in which the client 40 issues a request to the server system 10, the control unit 30 receives the request, and the child control process 32 is created is shown in FIG.
Shown in. In FIG. 4, the server monitors 71 and 72, the system monitors 81 and 82, the information processing devices 91 and 92, and the message receiving unit 31 are not shown in order to avoid complexity of the drawing.

【００３３】次に、制御部３０が生成した子制御プロセ
ス３２の動作について、図５乃至図９を参照して説明す
る。図５及び図６は子制御プロセスの処理フローであ
る。図７乃至図９は高信頼性サーバシステムの動作を説
明する図である。なお、図７乃至図９では図面の繁雑を
避けるため、図４と同様にサーバモニタ等の図示は省略
した。Next, the operation of the child control process 32 generated by the control unit 30 will be described with reference to FIGS. 5 and 6 are process flows of the child control process. 7 to 9 are diagrams for explaining the operation of the high reliability server system. 7 to 9, the server monitor and the like are omitted in the same manner as in FIG. 4 in order to avoid complication of the drawings.

【００３４】子制御プロセス３２は、サーバ管理表６０
を参照し、正常なサーバ２１，２２，２３とＴＣＰ接続
を行う（ステップＳＢ１）。この時の状態を図７に示
す。The child control process 32 uses the server management table 60.
To establish a TCP connection with the normal servers 21, 22, 23 (step SB1). The state at this time is shown in FIG.

【００３５】次に、子制御プロセス３２は、クライアン
ト４０から処理要求があるかどうかをチェックし（ステ
ップＳＢ２）、要求がある場合にはその要求をＦＩＦＯ
キューに格納し（ステップＳＢ３）、要求がない場合
は、何もせず次のステップに処理を移行する。Next, the child control process 32 checks whether or not there is a processing request from the client 40 (step SB2), and if there is a request, the request is FIFO.
It is stored in the queue (step SB3), and if there is no request, nothing is done and the process proceeds to the next step.

【００３６】もし、子制御プロセス３２がサーバからの
応答を待っており（ステップＳＢ４）、且つ、全ての応
答が揃った場合には（ステップＳＢ７）、応答の多数決
を行う（ステップＳＢ１２）。全ての応答が同一である
ならば（ステップＳＢ１３）、サーバに障害はないと判
断し、クライアント４０へ応答を一つ返す（ステップＳ
Ｂ１５）。もし、多数決の結果、全ての応答が同一でな
いならば（ステップＳＢ１３）、サーバ管理表６０に少
数派の応答を返したサーバの稼働状態を「正常」から
「障害」へ変更し、そのサーバをシステムから切り離す
（ステップＳＢ１４）。そして、クライアント４０へ正
しい応答を一つ返す（ステップＳＢ１５）（図９参
照）。If the child control process 32 is waiting for a response from the server (step SB4) and all the responses are complete (step SB7), the majority of the responses is decided (step SB12). If all the responses are the same (step SB13), it is determined that there is no fault in the server, and one response is returned to the client 40 (step S).
B15). If all the responses are not the same as the result of the majority decision (step SB13), the operating status of the server that returned the minority response to the server management table 60 is changed from "normal" to "failure", and the server is changed. It is disconnected from the system (step SB14). Then, one correct response is returned to the client 40 (step SB15) (see FIG. 9).

【００３７】その後に、子制御プロセス３２は再びクラ
イアント４０からの新しい要求があるかどうかのチェッ
クを行う（ステップＳＢ２）。After that, the child control process 32 again checks whether or not there is a new request from the client 40 (step SB2).

【００３８】もし、サーバからの応答を待っておらず
（ステップＳＢ４）、ＦＩＦＯキューが空ではない（要
求がある）なら（ステップＳＢ５）、ＦＩＦＯキューの
先頭から要求を一つ取り出してコピーを作成し、接続中
の全てのサーバ２１，２２，２３に転送する（ステップ
ＳＢ６）（図８参照）。If the server is not waiting for a response from the server (step SB4) and the FIFO queue is not empty (there is a request) (step SB5), one request is fetched from the head of the FIFO queue to make a copy. Then, the data is transferred to all the connected servers 21, 22, and 23 (step SB6) (see FIG. 8).

【００３９】また、もしサーバからの応答を待っておら
ず（ステップＳＢ４）、ＦＩＦＯキューが空であるなら
ば（ステップＳＢ５）、クライアント４０に対して行う
サービスは終了したことになり、子制御プロセス３２は
消滅する。If the server is not waiting for a response from the server (step SB4) and the FIFO queue is empty (step SB5), the service provided to the client 40 has ended and the child control process is completed. 32 disappears.

【００４０】全ての応答が揃っておらず（ステップＳＢ
７）、応答を受信したら（ステップＳＢ８）その応答を
バッファに蓄積しタイムアウトのチェックを行う（ステ
ップＳＢ１０）。応答を受信していなかったら（ステッ
プＳＢ８）応答の蓄積処理はせずタイムアウトのチェッ
クを行う（ステップＳＢ１０）。Not all responses are available (step SB
7) When the response is received (step SB8), the response is stored in the buffer and a timeout is checked (step SB10). If the response has not been received (step SB8), the response is not stored and the timeout is checked (step SB10).

【００４１】前記ステップＳＢ１０においてタイムアウ
トになっていたら、タイムアウトしたサーバからの応答
はＮＵＬＬメッセージとし（ステップＳＢ１１）、応答
の多数決を行う（ステップＳＢ１２）。一方、タイムア
ウトになっていなかったら（ステップＳＢ１０）、クラ
イアント４０からの新しい要求が来ているかをチェック
する（ステップＳＢ２）。If it has timed out in step SB10, the response from the timed-out server is a NULL message (step SB11), and the majority of the responses is decided (step SB12). On the other hand, if it has not timed out (step SB10), it is checked whether a new request from the client 40 has arrived (step SB2).

【００４２】次に、サーバモニタの動作について図１０
を参照して説明する。図１０はサーバモニタの処理フロ
ーである。Next, the operation of the server monitor will be described with reference to FIG.
Will be described with reference to. FIG. 10 is a process flow of the server monitor.

【００４３】サーバモニタは、監視対象となるサーバの
プロセス番号を獲得してリスト１を作成する（ステップ
ＳＣ１）。例えば、サーバがＡｐａｃｈｅの場合には／
ｕｓｒ／ｌｏｃａｌ／ａｐａｃｈｅ／ｌｏｇｓ／ｈｔｔ
ｐｄ．ｐｉｄというファイルにプロセス番号を登録する
ので、サーバモニタは該ファイルから監視対象のプロセ
ス番号を獲得する。The server monitor acquires the process number of the server to be monitored and creates list 1 (step SC1). For example, if the server is Apache, /
usr / local / apache / logs / htt
pd. Since the process number is registered in the file pid, the server monitor acquires the process number of the monitoring target from the file.

【００４４】次いで、サーバモニタは一定期間スリープ
した後に（ステップＳＣ２）、現在稼働中のサーバのプ
ロセス番号を獲得してリスト２を作成する（ステップＳ
Ｃ３）。例えば、Ｌｉｎｕｘの場合、／ｐｒｏｃディレ
クトリ配下に現在稼働中のプロセス番号が登録されてい
るため、ここからリスト２を作成する。Then, the server monitor sleeps for a certain period of time (step SC2), then acquires the process number of the server currently in operation and creates list 2 (step S).
C3). For example, in the case of Linux, since the process number currently in operation is registered under the / proc directory, list 2 is created from here.

【００４５】次に、サーバモニタはリスト１に存在して
いて且つリスト２に存在しないサーバがあるかどうかを
チェックする（ステップＳＣ４）。もし、そのようなサ
ーバがあった場合、そのサーバは何らかの理由（例え
ば、ｓｅｇｍｅｎｔａｔｉｏｎｆａｕｌｔ、ｂｕｓｅ
ｒｒｏｒなど）で障害となり死んだことを意味するの
で、メッセージ受信部３１にその旨を通知し（ステップ
ＳＣ５）、リスト１からこのサーバを削除する。その
後、再びスリープし（ステップＳＣ２）、現在稼働中の
サーバのチェックを行う（ステップＳＣ３）。Next, the server monitor checks whether or not there is a server that exists in list 1 but not in list 2 (step SC4). If there is such a server, it will be available for some reason (eg segmentation fault, bus e).
It means that the server has died due to an error in (error, etc.), so the message receiving unit 31 is notified (step SC5), and this server is deleted from the list 1. Then, it sleeps again (step SC2) and checks the server currently in operation (step SC3).

【００４６】一方、リスト１に存在していて且つリスト
２に存在しないサーバがない場合（ステップＳＣ４）、
障害があるサーバはない（正常である）ので、再びスリ
ープした後（ステップＳＣ２）、現在稼働中のサーバの
チェックを行う（ステップＳＣ３）。On the other hand, if there is no server that exists in list 1 but not in list 2 (step SC4),
Since there is no faulty server (normal), after sleeping again (step SC2), the server currently in operation is checked (step SC3).

【００４７】次にシステムモニタの動作について図１１
を参照して説明する。図１１はシステムモニタの処理フ
ローである。Next, the operation of the system monitor is shown in FIG.
Will be described with reference to. FIG. 11 is a process flow of the system monitor.

【００４８】システムモニタは、一定期間スリープし
（ステップＳＤ１）、メッセージ受信部３１に対してａ
ｌｉｖｅメッセージを送信する（ステップＳＤ２）とい
うステップを繰り返す。なお、システムモニタからのメ
ッセージが途絶えた場合、その情報処理装置は障害であ
ることがわかるのでサーバ管理表６０からサーバを切り
離す（図３のステップＳＡ５）。The system monitor sleeps for a certain period of time (step SD1), and a
The step of transmitting a live message (step SD2) is repeated. If the message from the system monitor is lost, the information processing apparatus is found to be in trouble, so the server is disconnected from the server management table 60 (step SA5 in FIG. 3).

【００４９】次に各サーバの組み込み処理（初期化時、
障害復旧時）について図１２を参照して説明する。図１
２はサーバ組み込み時のサーバモニタと制御部の処理フ
ローである。なお、ここでは情報処理装置と、その上で
動作しているサーバモニタ、システムモニタは正常に動
作しているものとする。Next, a process for incorporating each server (at initialization,
(At the time of failure recovery) will be described with reference to FIG. Figure 1
2 is a processing flow of the server monitor and the control unit when the server is installed. Here, it is assumed that the information processing apparatus and the server monitor and system monitor operating thereon are operating normally.

【００５０】サーバが動作する情報処理装置は、サーバ
であるＡｐａｃｈｅが立ち上がると／ｕｓｒ／ｌｏｃａ
ｌ／ａｐａｃｈｅ／ｌｏｇｓ／ｈｔｔｐｄ．ｐｉｄに立
ち上がったサーバのプロセス番号が登録されるため、サ
ーバモニタはこのファイルを参照することによって新た
なサーバが立ち上がったことを認識する（ステップＳＥ
１）。そして、新しく立ち上がったサーバの情報とし
て、ＩＰアドレス，ポート番号，情報処理装置の識別子
などをメッセージ処理部３１へ転送し（ステップＳＥ
２）、リスト１にこのサーバを追加する（ステップＳＥ
３）。The information processing device on which the server operates is / usr / loca when the server Apache is started up.
1 / apache / logs / httpd. Since the process number of the server started up is registered in pid, the server monitor recognizes that a new server has started up by referring to this file (step SE
1). Then, the IP address, the port number, the identifier of the information processing device, etc. are transferred to the message processing unit 31 as the information of the newly started server (step SE
2) Add this server to list 1 (step SE
3).

【００５１】一方、メッセージ受信部３１は、受け取っ
た情報を制御部３０に伝える。制御部３０はサーバ管理
表６０に新たなサーバ情報を付け加え、稼働状態を「正
常」とする（ステップＳＥ４）。On the other hand, the message receiving section 31 conveys the received information to the control section 30. The control unit 30 adds new server information to the server management table 60 and sets the operating state to "normal" (step SE4).

【００５２】以上のステップにより組み込み処理は完了
する。なお、サーバモニタ、システムモニタの起動方法
は、情報処理装置が立ち上がってからコマンドラインで
手入力するか、若しくは情報処理装置起動時にスクリプ
トで自動的に立ち上げる。The incorporation process is completed by the above steps. The server monitor and the system monitor can be started by manually inputting them from a command line after the information processing apparatus is started up or automatically by a script when the information processing apparatus is started up.

【００５３】次に、子制御プロセスが実現する要求と応
答の順序制御について図１３及び図１４を参照して説明
する。図１３は子制御プロセス３２がクライアント４０
とＴＣＰ接続を行い、さらに、サーバ２１，２２ともＴ
ＣＰ接続を行っている例を示す。Next, the request and response sequence control realized by the child control process will be described with reference to FIGS. 13 and 14. In FIG. 13, the child control process 32 is the client 40.
TCP connection with the
An example of CP connection is shown.

【００５４】クライアント４０から処理要求Ｘ１を受け
取った子制御プロセス３２は、最初にサーバ２１へ要求
Ｘ２を転送し、続いてサーバ２２へ要求Ｘ３を転送す
る。次に、子制御プロセス３２は、最初にサーバ２２か
ら応答Ｘ４を受信し、続いてサーバ２１からの応答Ｘ５
を受信している。最後に、子制御プロセス３２は、クラ
イアント４０へ応答Ｘ６を返す。Upon receiving the processing request X1 from the client 40, the child control process 32 first transfers the request X2 to the server 21, and then transfers the request X3 to the server 22. Then, the child control process 32 first receives the response X4 from the server 22, and then the response X5 from the server 21.
Are being received. Finally, the child control process 32 returns a response X6 to the client 40.

【００５５】このように、子制御プロセスがサーバへ処
理要求を転送する順序と、子制御プロセスがサーバから
応答をもらう順序は任意である。例えば、子制御プロセ
ス３２は、最初にサーバ２２へ要求Ｘ３を転送し、続い
てサーバ２１へ要求Ｘ２を転送してもよい。また、子制
御プロセス３２は、最初にサーバ２１からの応答Ｘ５を
受信し、続いてサーバ２２からの応答Ｘ４を受信しても
よい。As described above, the order in which the child control process transfers the processing request to the server and the order in which the child control process receives the response from the server are arbitrary. For example, the child control process 32 may first transfer the request X3 to the server 22, and then transfer the request X2 to the server 21. Further, the child control process 32 may first receive the response X5 from the server 21 and subsequently receive the response X4 from the server 22.

【００５６】図１４は、子制御プロセス３２がクライア
ント４０とＴＣＰ接続を行い、さらにサーバ２１，２２
ともＴＣＰ接続を行っている別の例を示す。クライアン
ト４０から処理要求Ｙ１を受け取った子制御プロセス３
２は、最初にサーバ２１へ要求Ｙ２を転送し、続いてサ
ーバ２２へ要求Ｙ３を転送している。次に、子制御プロ
セス３２は、サーバ２２からの応答Ｙ４を受信したが、
サーバ２１からの応答Ｙ５を受信しないうちに、クライ
アント４０から新た要求Ｚ１を受信した。In FIG. 14, the child control process 32 makes a TCP connection with the client 40, and further the servers 21 and 22.
Another example is shown in which a TCP connection is made. Child control process 3 that received the processing request Y1 from the client 40
2 first transfers the request Y2 to the server 21, and then transfers the request Y3 to the server 22. Next, the child control process 32 receives the response Y4 from the server 22,
The new request Z1 was received from the client 40 before the response Y5 from the server 21 was received.

【００５７】この場合、要求Ｚ１をサーバ２１，２２へ
転送せずに、要求Ｙ１に対する応答Ｙ６をクライアント
４０へ返すまで待たなければならない。これは、この状
況で新たな要求Ｚ１を各サーバに転送すると、各サーバ
２１，２２は異なる動作をする可能性があるためであ
る。In this case, it is necessary to wait until the response Y6 to the request Y1 is returned to the client 40 without transferring the request Z1 to the servers 21 and 22. This is because, if a new request Z1 is transferred to each server in this situation, each server 21, 22 may operate differently.

【００５８】そして、子制御プロセス３２は、サーバ２
１からの応答Ｙ５を受信し、クライアント４０へ応答を
返すと、要求Ｚ２，Ｚ３をそれぞれサーバ２１，２２へ
転送する。次いで、子制御プロセス３２は、サーバ２１
から応答Ｚ４を受信し、続いてサーバ２２から応答Ｚ５
を受信する。最後に子制御プロセス３２は、クライアン
ト４０へ応答Ｚ６を返す。Then, the child control process 32 uses the server 2
When the response Y5 from 1 is received and the response is returned to the client 40, the requests Z2 and Z3 are transferred to the servers 21 and 22, respectively. Then, the child control process 32 sends the server 21
Response Z4 from server 22, followed by response Z5 from server 22
To receive. Finally, the child control process 32 returns the response Z6 to the client 40.

【００５９】以上のように、本実施の形態に係る高信頼
性サーバシステムによれば、制御部３０により、クライ
アント４０からの処理要求が各サーバ２１，２２，２３
に転送される。そして、サーバ２１，２２，２３からの
応答結果に対して正当性が判定され、正当な応答結果の
みがクライアント４０に転送される。これにより、クラ
イアント４０に対してサーバにおける「誤り」を隠蔽で
きる。また、サーバモニタ７１，７２及びシステムモニ
タ８１，８２によりサーバの動作が監視できるので、サ
ーバ自体の障害又はサーバが動作している情報処理装置
の障害が発生しても該サーバをシステムから切り離すな
どの処理を行うことにより、システム全体として耐障害
性に優れたものとなる。As described above, according to the high-reliability server system according to the present embodiment, the control unit 30 sends the processing request from the client 40 to each of the servers 21, 22, 23.
Transferred to. Then, the validity is determined with respect to the response results from the servers 21, 22, and 23, and only the valid response result is transferred to the client 40. This makes it possible to hide the “error” in the server from the client 40. Further, since the server monitor 71, 72 and the system monitor 81, 82 can monitor the operation of the server, the server is disconnected from the system even when the failure of the server itself or the failure of the information processing device in which the server operates occurs. By performing the processing of 1, the system as a whole becomes excellent in fault tolerance.

【００６０】なお、異なるクライアントからの要求を別
の子制御プロセスを生成することにより、処理の並列化
を図り、これにより性能の向上を図ることができる。例
えば、図１５の例では、クライアント４０からの要求は
子制御プロセス３２が処理をし、クライアント４１から
の要求は子制御プロセス３３が処理する。また、この場
合には、異なるクライアント同士の要求・応答の順序制
御は行わない。By creating another child control process for requests from different clients, it is possible to parallelize the processing and thereby improve the performance. For example, in the example of FIG. 15, the request from the client 40 is processed by the child control process 32, and the request from the client 41 is processed by the child control process 33. Further, in this case, order control of requests / responses between different clients is not performed.

【００６１】このように、サーバに特別な構成を必要と
することなく誤りの隠蔽ができるので、安価な汎用コン
ピュータを用いた障害耐性に優れたシステムの構築が可
能となる。また、ソフトウェアやハードウェアの高性能
化によるサーバの高性能化を図り、これによりシステム
全体の高性能化を容易に行うことができる。さらに、前
述したクロック配線遅延問題や位相合わせ問題も解決で
きるため高性能なシステムを構築できる。さらに、ハー
ドウェアの障害だけでなくソフトウェア（サーバ）の障
害をも検出することができるので、システム全体の耐障
害性が優れたものとなる。In this way, since errors can be concealed without requiring a special configuration in the server, it is possible to construct a system with excellent fault tolerance using an inexpensive general-purpose computer. In addition, the performance of the server can be improved by improving the performance of software and hardware, and thus the performance of the entire system can be easily improved. Further, since the clock wiring delay problem and the phase matching problem described above can be solved, a high performance system can be constructed. Further, not only hardware failure but also software (server) failure can be detected, so that the fault tolerance of the entire system becomes excellent.

【００６２】以上本発明の実施形態について説明したが
本発明はこれに限定されるものではない。本発明の範囲
は特許請求の範囲によって示されており、全ての変形例
は本発明に含まれるものである。Although the embodiment of the present invention has been described above, the present invention is not limited to this. The scope of the invention is indicated by the claims and all the modifications are included in the invention.

【００６３】例えば、各サーバは同じ応答を返すならば
同じ実装である必要はない。すなわち、バージョン・仕
様・プログラム言語・コンパイラ・コンパイラオプショ
ンなどが異なっていてもよい。また、サーバは一つのプ
ロセスから構成されていも複数のプロセスから構成され
ていてもよい。さらに、プロセスではなくオブジェクト
やスレッドでもよい。For example, each server need not have the same implementation if it returns the same response. That is, the versions, specifications, programming languages, compilers, compiler options, etc. may be different. Further, the server may be composed of one process or a plurality of processes. Furthermore, it may be an object or thread instead of a process.

【００６４】さらに、子制御プロセスと各サーバのＴＣ
Ｐ接続のオーバヘッドを軽減するために、あらかじめ複
数の子制御プロセスを作っておき、それらをサーバとあ
らかじめＴＣＰ接続をしておき、クライアントからの要
求を待ち受けてもよい。Furthermore, the child control process and TC of each server
In order to reduce the overhead of P connection, it is possible to create a plurality of child control processes in advance, make TCP connection with them in advance, and wait for a request from the client.

【００６５】さらに、上記実施の形態では、三つのサー
バ間で多数決を行うことにより応答の正当性を判断して
いるが、応答結果のフォーマットの正常性により判断し
てもよい。例えば、応答結果がｈｔｔｐのプロトコルフ
ォーマットに合致しているかどうかで正当性を判断す
る。また、要求と応答のマッチングから応答の正当性を
判断してもよい。例えば、ｈｔｔｐにおけるＧＥＴ要求
に対する応答のヘッダにＣｏｎｔｅｘｔ−Ｌｅｎｇｔｈ
が含まれており、ボディにデータが含まれている応答が
正当性を有する応答であると判断する。さらに、これら
の判断方法と多数決を組み合わせてもよい。Further, in the above embodiment, the validity of the response is determined by making a majority vote among the three servers, but it may be determined based on the normality of the format of the response result. For example, the validity is determined by whether or not the response result conforms to the http protocol format. Further, the validity of the response may be judged from the matching of the request and the response. For example, in the header of the response to the GET request in http, the Context-Length
Is included, and the response in which the body contains data is determined to be a valid response. Further, these judgment methods may be combined with the majority vote.

【００６６】さらに、図１６に示すように、二つの制御
部３０，３０ａを設けることにより仲介装置の冗長化を
図り、これにより仲介装置の高信頼化を図ってもよい。Further, as shown in FIG. 16, by providing two control units 30 and 30a, the intermediary device may be made redundant, and thereby the intermediary device may be made highly reliable.

【００６７】さらに、図１７に示すように、処理の要求
元のクライアントに応じて、処理要求に対する処理を行
うサーバを分散させるようにしてもよい。図１７では、
クライアント４０についてはサーバ２１，２２，２３が
処理を行い、クライアント４１についてはサーバ２４，
２５，２６が処理を行っている。このような構成により
負荷分散が図れるので、サーバシステム全体の処理能力
を向上できる。Furthermore, as shown in FIG. 17, the servers that perform the processing corresponding to the processing request may be distributed according to the client that requested the processing. In FIG. 17,
For the client 40, the servers 21, 22, and 23 perform processing, and for the client 41, the server 24,
25 and 26 are processing. Since load distribution can be achieved by such a configuration, the processing capacity of the entire server system can be improved.

【００６８】さらに、上記実施の形態ではＯＳとしてＵ
ＮＩＸ（登録商標）の一つであるＬｉｎｕｘを前提とし
て説明したが、他のＯＳであってもよい。Further, in the above embodiment, the OS is U
Although the description has been made on the premise of Linux which is one of the UNIX (registered trademark), other OS may be used.

【００６９】[0069]

【発明の効果】以上詳述したように、本発明によれば、
仲介手段により、クライアントからの処理要求がサーバ
に転送される。そして、サーバからの応答結果に対して
正当性が判定され、正当な応答結果のみがクライアント
に転送される。これにより、クライアントに対してサー
バにおける「誤り」を隠蔽できる。また、サーバ監視手
段によりサーバの動作が監視できるので、サーバに障害
が発生しても該サーバをシステムから切り離すなどの処
理を行うことによりシステム全体として耐障害性に優れ
たものとなる。このように、サーバに特別な構成を必要
とすることなく誤りの隠蔽ができるので、安価な汎用コ
ンピュータを用いた障害耐性に優れたシステムの構築が
可能となる。また、ソフトウェアやハードウェアの高性
能化によるサーバの高性能化を図り、これによりシステ
ム全体の高性能化を容易に行うことができる。さらに、
前述したクロック配線遅延問題や位相合わせ問題も解決
できるため高性能なシステムを構築できる。As described in detail above, according to the present invention,
The mediation means transfers the processing request from the client to the server. Then, the validity of the response result from the server is determined, and only the valid response result is transferred to the client. This makes it possible to hide the “error” in the server from the client. Further, since the operation of the server can be monitored by the server monitoring means, even if a failure occurs in the server, by performing processing such as disconnecting the server from the system, the entire system becomes excellent in fault tolerance. In this way, since errors can be concealed without requiring a special configuration in the server, it is possible to construct a system with excellent fault tolerance using an inexpensive general-purpose computer. In addition, the performance of the server can be improved by improving the performance of software and hardware, and thus the performance of the entire system can be easily improved. further,
Since the clock wiring delay problem and the phase matching problem described above can be solved, a high-performance system can be constructed.

[Brief description of drawings]

【図１】高信頼性サーバシステムの全体構成を示すブロ
ック図FIG. 1 is a block diagram showing the overall configuration of a highly reliable server system.

【図２】サーバ管理表の一例を説明する図FIG. 2 is a diagram illustrating an example of a server management table.

【図３】制御部の動作を説明するフローチャートFIG. 3 is a flowchart illustrating the operation of the control unit.

【図４】高信頼性サーバシステムの動作を説明する図FIG. 4 is a diagram for explaining the operation of the high reliability server system.

【図５】子制御プロセスの動作を説明するフローチャー
トFIG. 5 is a flowchart illustrating the operation of a child control process.

【図６】子制御プロセスの動作を説明するフローチャー
トFIG. 6 is a flowchart illustrating the operation of a child control process.

【図７】高信頼性サーバシステムの動作を説明する図FIG. 7 is a diagram for explaining the operation of the high reliability server system.

【図８】高信頼性サーバシステムの動作を説明する図FIG. 8 is a diagram for explaining the operation of the high reliability server system.

【図９】高信頼性サーバシステムの動作を説明する図FIG. 9 is a diagram for explaining the operation of the high reliability server system.

【図１０】サーバモニタの動作を説明するフローチャー
トFIG. 10 is a flowchart illustrating the operation of the server monitor.

【図１１】システムモニタの動作を説明するフローチャ
ートFIG. 11 is a flowchart illustrating the operation of the system monitor.

【図１２】サーバ組み込み時のサーバモニタと制御部の
動作を説明するフローチャートFIG. 12 is a flowchart illustrating operations of a server monitor and a control unit when the server is installed.

【図１３】子制御プロセスにおける処理の順序制御を説
明する図FIG. 13 is a diagram for explaining order control of processing in a child control process.

【図１４】子制御プロセスにおける処理の順序制御を説
明する図FIG. 14 is a diagram for explaining order control of processing in a child control process.

【図１５】他の例に係る高信頼性サーバシステムの動作
を説明する図FIG. 15 is a diagram for explaining the operation of a highly reliable server system according to another example.

【図１６】他の例に係る高信頼性サーバシステムの全体
構成を説明するブロック図FIG. 16 is a block diagram illustrating an overall configuration of a high reliability server system according to another example.

【図１７】他の例に係る高信頼性サーバシステムの全体
構成を説明するブロック図FIG. 17 is a block diagram illustrating the overall configuration of a highly reliable server system according to another example.

【図１８】従来の高信頼性システムの構成図FIG. 18 is a block diagram of a conventional high reliability system.

[Explanation of symbols]

１０…高信頼性サーバシステム、２０…高信頼性サー
バ、２１，２２，２３…サーバ、３０…制御部、３１…
メッセージ受信部、３２…子制御プロセス、４０…クラ
イアント、５０…ネットワーク、６０…サーバ管理表、
７１，７２…サーバモニタ、８１，８２…システムモニ
タ、９１，９２，９３…情報処理装置、10 ... High-reliability server system, 20 ... High-reliability server, 21, 22, 23 ... Server, 30 ... Control unit, 31 ...
Message receiving unit, 32 ... Child control process, 40 ... Client, 50 ... Network, 60 ... Server management table,
71, 72 ... Server monitor, 81, 82 ... System monitor, 91, 92, 93 ... Information processing device,

───────────────────────────────────────────────────── フロントページの続き (72)発明者増田悦夫東京都千代田区大手町二丁目３番１号日本電信電話株式会社内 (72)発明者藤田昭人東京都新宿区西新宿二丁目１番１号エヌ・ティ・ティ・アドバンステクノロジ株式会社内 (72)発明者土方陽一東京都新宿区西新宿二丁目１番１号エヌ・ティ・ティ・アドバンステクノロジ株式会社内Ｆターム(参考） 5B034 BB15 CC01 DD05 5B042 GA12 GC10 JJ05 JJ23 KK04 5B045 JJ06 JJ38 5B089 GA11 GA21 GB02 JB17 JB22 KA12 KC29 KC30 MC06 MC08 ME02 ME06 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Etsuo Masuda 2-3-1, Otemachi, Chiyoda-ku, Tokyo Inside Telegraph and Telephone Corporation (72) Inventor Akito Fujita 2-1-1, Nishishinjuku, Shinjuku-ku, Tokyo Nutty Advance Technology Co., Ltd. Inside the company (72) Inventor Yoichi Hijikata 2-1-1, Nishishinjuku, Shinjuku-ku, Tokyo Nutty Advance Technology Co., Ltd. Inside the company F term (reference) 5B034 BB15 CC01 DD05 5B042 GA12 GC10 JJ05 JJ23 KK04 5B045 JJ06 JJ38 5B089 GA11 GA21 GB02 JB17 JB22 KA12 KC29 KC30 MC06 MC08 ME02 ME06

Claims

[Claims]

1. In a server-side system in a client-server system, one or more servers mounted on one or more information processing devices, a server monitoring means for monitoring the operation of the server, and a processing request from a client. Processing request transfer means for transferring to the server, response result holding means for holding the response result received from the server in response to the processing request, legitimacy judging means for judging the correctness of the held response result, and the judgment result. A server system characterized in that an intermediary means having a response result transfer means for transferring only the response result to the requesting client is interposed between the client and the server.

2. The server monitoring means monitors whether or not the server is operating normally in the information processing apparatus and notifies the mediating means of a server failure when there is an abnormal operation of the server. The server system according to claim 1, further comprising a monitoring unit.

3. The server system according to claim 2, wherein the mediating means includes means for disconnecting the server from the system when the server failure notification is received from the first monitoring means.

4. The server monitoring means comprises a second monitoring means which is mounted on the information processing apparatus and which notifies the intermediary means of a survival confirmation message at a predetermined cycle. 3. The server system according to any one of 3 above.

5. The information processing apparatus in which the second monitoring means is mounted when the mediating means does not receive a survival confirmation message from the second monitoring means even after a predetermined time longer than the predetermined cycle has elapsed. 5. The server system according to claim 4, further comprising means for disconnecting the server from the system.

6. The apparatus according to claim 1, further comprising a plurality of the servers, wherein the validity determination means compares the response results from the servers with each other and determines that the largest number of response results is valid. The server system according to claim 1.

7. The server system according to claim 1, wherein the legitimacy determining means determines the legitimacy of the response result from the server that responds within a predetermined time.

8. The server system according to claim 1, wherein the validity determination unit determines that the response result having the predetermined pattern is valid.

9. The server system according to claim 1, wherein the legitimacy determining means determines the legitimacy of the response result based on the correspondence between the processing request and the response result.

10. The device according to claim 1, further comprising means for disconnecting an output source server of a response result, which has not been determined to be valid by the validity determination means, from the system. Server system.

11. One or more servers mounted on one or more information processing devices, and a processing request transfer unit that intervenes between the client and the server and transfers a processing request from the client to the server. A response result holding unit that holds the response result received from the server in response to the processing request, a legitimacy determination unit that determines the legitimacy of the held response result, and only the response result that is determined to be legitimate to the requesting client. A failure detection method in a client-server system, comprising: detecting a failure by monitoring a server operation in a client-server system including an intermediary means having a response result transfer means for transferring.