JPH06274432A

JPH06274432A - System and method for managing distributed computer system

Info

Publication number: JPH06274432A
Application number: JP5059908A
Authority: JP
Inventors: Yasuhiro Izumida; 泰弘泉田; Hideji Tsutsumi; 秀二堤; Tatsugoro Nakatani; 辰五郎中谷; Hiroaki Okahara; 弘明岡原; Kazunari Nakamura; 一成中村
Original assignee: NKK Corp; Nippon Kokan Ltd
Current assignee: JFE Engineering Corp
Priority date: 1993-03-19
Filing date: 1993-03-19
Publication date: 1994-09-30

Abstract

PURPOSE:To effectively utilize distributed system resources and to improve the fault countermeasure, in the distributed computer environment. CONSTITUTION:In the system management system by a client/server system of plural computers coupled by a network, this system is provided with a program execution managing means provided with a means for receiving a connection request from a client to a server, a distributed system management control means for receiving the connection request through its means, and a distributed system resources information data base. Also, the distributed system management control means is provided with a server selecting means for selecting the computer having an optimal server by using information of the data base, and a means for informing the program execution managing means about a result of its selection, and the program execution managing means is provided with a means for requesting connection preparations including server program actuation to the computer having the server, in this distributed computer system management system.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は分散計算機環境におい
て、分散システム資源の有効活用と障害対策の技術に関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for effectively utilizing distributed system resources and dealing with failures in a distributed computer environment.

【０００２】[0002]

【従来の技術】分散システムでの計算資源情報サービス
（ネーミングサービス）の従来技術としては、サンマイ
クロシステムズのＮＩＳ（ＮｅｔｗｏｒｋＩｎｆｏｒ
ｍａｔｉｏｎＳｅｒｖｉｃｅ）がある。これはユーザ
の登録情報、計算機のアドレス情報、およびネットワー
クの構成情報などの静的な管理情報を一元的に管理する
システムである（ＵＮＩＸＣｏｍｍｕｎｉｃａｔｉｏ
ｎＮｏｔｅｓ３６，ＵＮＩＸＭＡＧＡＺＩＮＥ
（１９９１．６），ｐ６１−７１．と、ＵＮＩＸＣｏｍ
ｍｕｎｉｃａｔｉｏｎＮｏｔｅｓ３７，ＵＮＩＸ
ＭＡＧＡＺＩＮＥ（１９９１．７），ｐ６９−８
１．）。2. Description of the Related Art As a conventional technology of computing resource information service (naming service) in a distributed system, Sun Microsystems NIS (Network Info) is known.
(Mation Service). This is a system that centrally manages static management information such as user registration information, computer address information, and network configuration information (UNIX Communicatio).
n Notes 36, UNIX MAGAZINE
(1991.6), p61-71. And UNIXCom
communication Notes 37, UNIX
MAGAZINE (1991.7), p69-8
1. ).

【０００３】ヒューレットパッカードのＨＰＯｐｅｎ
Ｖｉｅｗや、サンマイクロシステムズのＳｕｎＮｅｔ
Ｍａｎａｇｅｒに代表されるネットワーク監視システム
は、ネットワークのトラフィック情報のような動的情報
を収集する仕組みを提供している。ヒューレットパッカ
ードのロケーションブローカは、ＲＰＣ（Ｒｅｍｏｔｅ
ＰｒｏｃｅｄｕｒｅＣｏｎｔｒｏｌ）を用いたクラ
イアント／サーバ型アプリケーションにおいて、サーバ
の位置をネットワークワイドに管理するネームサービス
である（分散コンピューティングの最前線、日経エレク
トロニス、ｎｏ．５０２（１９９０．６）、ｐ１３７−
１３８）。Hewlett Packard's HP Open
View or SunNet from Sun Microsystems
A network monitoring system represented by Manager provides a mechanism for collecting dynamic information such as network traffic information. Hewlett-Packard's location broker is RPC (Remote
This is a name service that manages the location of the server network-wide in a client / server type application using the Procedure Control (the forefront of distributed computing, Nikkei Electronics, no. 502 (1990.6), p137-
138).

【０００４】負荷分散の従来技術としては、ヒューレッ
ドパッカードのタスクブローカがある。処理の依頼があ
った時には、要求元のクライアントマシンからネットワ
ーク内の全てのサーバマシンに計算機稼働状態の問い合
わせを行う方式を用いている（分散コンピューティング
の最前線、日経エレクトロニクス、ｎｏ．５０２（１９
９０．６）、ｐ１４５−１４６）。As a conventional load balancing technique, there is a task broker of Hewlett-Packard. When a request for processing is made, a method is used in which a requesting client machine inquires of all server machines in the network about the computer operating status (the front line of distributed computing, Nikkei Electronics, no. 502 (19).
90.6), p145-146).

【０００５】統合的な分散システム管理技術としてＯＳ
Ｆ（ＯｐｅｎＳｏｆｔｗａｒｅＦｏｕｎｄａｔｉｏ
ｎ) のＤＣＥ（ＤｉｓｔｒｉｂｕｔｅｄＣｏｍｐｕｔ
ｉｎｇＥｎｖｉｒｏｎｍｅｎｔ）がある（分散コンピ
ューティング環境ＤＣＥの全貌、ＣｏｍｐｕｔｅｒＴ
ｏｄａｙ、ｎｏ．４４（１９９１．７），ｐ１７−４
５）。OS as an integrated distributed system management technology
F (Open SoftwareFoundation)
n) DCE (Distributed Comput)
ing Environment (Full picture of distributed computing environment DCE, Computer T
oday, no. 44 (1991.7), p17-4.
5).

【０００６】[0006]

【発明が解決しようとする課題】分散システムを管理す
るためには、分散した計算資源情報を管理するネーミン
グサービス、分散システム全体を効率的に利用する負荷
分散、および障害発生に備えてのシステムの多重化、な
どの要素技術が必要である。また、複雑な分散システム
構成をユーザに意識させず、あたかも１台のコンピュー
タであるかのように見せかけるためには、これら要素技
術の統合が必要である。In order to manage a distributed system, a naming service that manages distributed computing resource information, load distribution that efficiently uses the entire distributed system, and a system that prepares for failure occurrence are provided. Elemental technologies such as multiplexing are required. Further, in order to make a user think of a complicated distributed system configuration as if it were one computer, it is necessary to integrate these elemental technologies.

【０００７】ネーミングサービスの従来技術であるサン
マイクロシステムズのＮＩＳ（ＮｅｔｗｏｒｋＩｎｆ
ｏｒｍａｔｉｏｎＳｅｒｖｉｃｅ）はユーザの登録情
報、計算機のアドレス情報、およびネットワークの構成
情報などの静的な管理情報を管理するシステムに止ま
る。負荷分散や障害対策迄考えると、各計算機やネット
ワークの稼働状況などの動的な情報も必要であるが、こ
のような情報は管理していない。[0007] The conventional technology of the naming service, Sun Microsystems NIS (Network Inf)
information service) is limited to a system that manages static management information such as user registration information, computer address information, and network configuration information. Considering load balancing and troubleshooting, dynamic information such as the operating status of each computer and network is also necessary, but such information is not managed.

【０００８】ヒューレットパッカードのＨＰＯｐｅｎ
Ｖｉｅｗや、サンマイクロシステムズのＳｕｎＮｅｔ
Ｍａｎａｇｅｒに代表されるネットワーク監視システム
は、このような動的情報を収集する仕組みを提供してい
るが、収集する情報はネットワークの稼働状況に関する
ものが多く、より重要な各計算機の稼働状況（ＣＰＵの
負荷、メモリの使用状況、ディスクの使用状況、プロセ
スの実行状況など）を示す情報については標準仕様では
収集していない。Hewlett Packard HP Open
View or SunNet from Sun Microsystems
A network monitoring system typified by Manager provides a mechanism for collecting such dynamic information, but the collected information is mostly related to the operating status of the network, and the more important operating status of each computer (CPU Load, memory usage, disk usage, process execution status, etc.) is not collected in the standard specifications.

【０００９】ヒューレットパッカードのロケーションブ
ローカでは、ユーザ（クライアント）はサーバの位置を
意識する必要はないが、サーバの選択時に分散システム
の負荷分散を十分に考慮していない。In the Hewlett-Packard location broker, the user (client) does not need to be aware of the location of the server, but does not fully consider the load distribution of the distributed system when selecting the server.

【００１０】負荷分散の従来技術であるタスクブローカ
はバッチ処理型のプログラム実行の負荷分散を目的にし
たものであり、ユーザとのインタラクティブな対話操作
の多いクライアント／サーバ型システム向きではない。
また処理の依頼があった時には、要求元のクライアント
マシンから分散システム内の全てのサーバマシンに計算
機稼働状態の問い合わせを行う必要がある。A task broker, which is a conventional technique for load balancing, is intended for load balancing of program execution of batch processing type, and is not suitable for a client / server type system that often involves interactive interaction with a user.
Further, when a processing request is made, it is necessary for the requesting client machine to make an inquiry about the computer operating status to all the server machines in the distributed system.

【００１１】統合的な分散システム管理技術であるＤＣ
Ｅ（ＤｉｓｔｒｉｂｕｔｅｄＣｏｍｐｕｔｉｎｇＥ
ｎｖｉｒｏｎｍｅｎｔ）は、通信手段やネーミングサー
ビス、およびセキュリティサービスなどの基本的な分散
システム管理の要素技術を統合したシステムであるが、
これだけでは負荷分散や障害対策まで含めた管理を行う
ことはできない。DC which is an integrated distributed system management technology
E (Distributed Computing E)
nvironment) is a system that integrates basic distributed system management element technologies such as communication means, naming services, and security services.
With this alone, it is not possible to perform management including load balancing and troubleshooting.

【００１２】[0012]

【課題を解決するための手段】第１の発明は、ネットワ
ークで結合された複数の計算機のクライアント・サーバ
方式によるシステム管理方式であって、前記計算機に設
置されクライアントからのサーバへの接続要求を受け取
る手段を備えているプログラム実行管理手段と、前記プ
ログラム実行管理手段を介して前記クライアントからの
接続要求を受け取る分散システム管理制御手段と、分散
システム資源情報データベースとを備え、かつ、前記分
散システム管理制御手段は、分散システム資源情報デー
タベースの情報を用いて最適なサーバを持つ計算機を選
択するサーバ選択手段とそのサーバ選択手段の選択結果
をプログラム実行管理手段に通知する手段を備えてお
り、前記プログラム実行管理手段は前記のサーバを持つ
計算機にサーバプログラム起動を含む接続準備を要求す
る手段を備えている分散計算機システム管理方式であ
る。A first aspect of the present invention is a system management system based on a client-server system of a plurality of computers connected by a network, wherein a client-server system requests a connection request from a client to a server. And a distributed system resource information database; and a distributed system resource information database, the distributed system management comprising: a program execution management unit including a receiving unit; a distributed system management control unit that receives a connection request from the client via the program execution management unit; The control means includes a server selection means for selecting a computer having an optimum server using the information in the distributed system resource information database, and means for notifying the program execution management means of the selection result of the server selection means. Execution management means is the server It is a distributed computer system management method and a means for requesting connection preparation comprising ram start.

【００１３】第２の発明は、前記分散システム管理制御
手段は、分散システム全体のクライアントとサーバの接
続状況を登録する障害管理テーブルを備え、前記プログ
ラム実行管理手段は、自己の管理する計算機に関する障
害管理テーブルを備えていることとした上記記載の分散
計算機システム管理方式である。According to a second aspect of the present invention, the distributed system management control means comprises a failure management table for registering the connection status of the clients and servers of the entire distributed system, and the program execution management means is a failure related to a computer managed by itself. The distributed computer system management method described above is provided with a management table.

【００１４】第３の発明は、前記分散システム管理制御
手段を分散システム内に複数備え、プログラム実行管理
手段は、起動時に分散システム内の全分散システム管理
制御手段に対し接続要求を発行し、かつ、所定の順位で
１つの分散システム管理制御手段とその分散システム管
理制御手段の障害時の予備となる他の１つの分散システ
ム管理制御手段を決定する手段を備え、前記２つの分散
システム管理制御手段は、プログラム実行管理手段が管
理するプログラム情報の複製をそれぞれ持っていること
とした上記記載の分散計算機システム管理方式である。According to a third aspect of the present invention, a plurality of distributed system management control means are provided in a distributed system, and the program execution management means issues a connection request to all distributed system management control means in the distributed system at start-up, and , A means for determining one distributed system management control means in a predetermined order and another distributed system management control means to be a backup in case of failure of the distributed system management control means, the two distributed system management control means Is the distributed computer system management method described above in which each has a copy of the program information managed by the program execution management means.

【００１５】第４の発明は、前記分散システム資源情報
データベースを分散システム内に複数備え、マスターと
なる分散資源情報データベースに対して残りの１つ以上
の分散システム資源情報データベースを予備とし、マス
ターの分散システム資源情報データベースの複製を予備
に持たせ、データ更新要求があるとマスターの分散シス
テム資源情報データベースを更新し、その後に、予備の
分散システム資源情報データベースに要求を送り予備の
分散システム資源情報データベースを更新し、分散シス
テム資源情報データベースにはデータ更新がある度に更
新されるＩＤ番号を付与し、マスターの分散システム資
源情報データベースに障害が発生した場合は、予備の分
散システム資源情報データベースの中で最新のＩＤ番号
を持ち、かつ、データ更新要求に最初に応答したものを
新たなマスターの分散システム資源情報データベースに
選ぶことにより分散システム資源情報データベースのサ
ービス処理を継続することとした分散計算機システム管
理方法である。According to a fourth aspect of the present invention, a plurality of distributed system resource information databases are provided in a distributed system, and one or more remaining distributed system resource information databases are reserved for the master distributed resource information database, and the master A backup copy of the distributed system resource information database is provided, and when there is a data update request, the master distributed system resource information database is updated, and then a request is sent to the backup distributed system resource information database. The database is updated, the distributed system resource information database is given an ID number that is updated every time there is a data update, and when a failure occurs in the master distributed system resource information database, the backup distributed system resource information database Has the latest ID number and The data update request is a distributed computer system management method was decided to continue the service processing of the distributed system resource information database by selecting one in response to the new master of the distributed system resource information database first.

【００１６】[0016]

【作用および実施例】ＵＮＩＸ上でＣ言語を用いて、プ
ログラム実行管理手段、分散システム管理制御手段、お
よび分散システム資源情報データベースを実施した例を
示す。構成モジュール間の通信にはサンマイクロシステ
ムズのＯＮＣ／ＲＰＣバージョン４を用いた。また、管
理制御対象はＸウィンドウを含むクライアント／サーバ
型プログラムである。ＸウィンドウシステムはＵＮＩＸ
標準のウィンドウシステムとして様々な機種に採用され
ており、既存アプリケーションプログラムのほとんどは
本システムで利用できる。OPERATION AND EXAMPLE An example in which the program execution management means, the distributed system management control means, and the distributed system resource information database are implemented using the C language on UNIX will be described. ONC / RPC version 4 from Sun Microsystems was used for communication between the constituent modules. The management control target is a client / server type program including X windows. X Window System is UNIX
It is used as a standard window system in various models, and most of the existing application programs can be used in this system.

【００１７】分散システム管理制御手段が分散した計算
資源を管理する仕組みを図１に示す。分散システム管理
制御手段の機能は、情報収集系（図中網かけ部）とシス
テム制御系（図中斜線部）に分けられる。情報収集系が
収集するネットワークの状態情報などを用いて、システ
ム制御系が効率的な計算機環境をユーザに提供する。従
来は、ユーザがサーバの存在する計算機を指定して接続
したり、障害発生時に別計算機のサーバに切替えたりし
ていた。しかし、分散システム管理制御手段の管理下で
は、ユーザはサーバの位置や障害発生を全く意識せずに
システムを利用することができる。FIG. 1 shows a mechanism for managing distributed computing resources by the distributed system management control means. The function of the distributed system management control means is divided into an information collection system (hatched part in the figure) and a system control system (hatched part in the figure). The system control system provides the user with an efficient computer environment by using the network status information collected by the information collection system. Conventionally, a user specifies a computer in which a server exists and connects to it, or switches to a server of another computer when a failure occurs. However, under the control of the distributed system management control means, the user can use the system without being aware of the location of the server or the occurrence of a failure.

【００１８】次に情報収集系について説明する。計算資
源情報には、ユーザ情報や計算機配置情報などのシステ
ム構築時に予め定める静的な情報と、計算機の負荷やデ
ィスクの使用状況などの時々刻々と変化する動的な情報
がある。既存の計算資源情報サービスは静的な情報を扱
うものがほとんであるが、ここでは動的な情報も含めた
計算資源情報を分散システム資源情報データベースで一
元的に管理し、負荷分散や自動障害回復などの分散シス
テム制御まで統合的に行うことを考えた。Next, the information collecting system will be described. The computational resource information includes static information such as user information and computer placement information that is predetermined when the system is constructed, and dynamic information that changes from moment to moment such as the load on the computer and the usage status of the disk. Most existing computing resource information services handle static information, but here, the computing resource information including dynamic information is centrally managed by the distributed system resource information database, and load balancing and automatic failure are performed. We considered to perform distributed system control such as recovery in an integrated manner.

【００１９】動的情報の収集には、標準ネットワーク管
理プロトコルＳＮＭＰ（ＳｉｍｐｌｅＮｅｔｗｏｒｋ
ＭａｎａｇｅｍｅｎｔＰｒｏｔｏｃｏｌ）を用いた
ネットワーク監視システムを使用した。図１に示すよう
に、各計算機に配置したエージェントが集めた情報を、
ネットワーク管理ステーションが収集、表示する機構に
なっている。エージェントが集める情報は管理情報ベー
ス（ＭＩＢ，ＭａｎａｇｅｍｅｎｔＩｎｆｏｒｍａ
ｔｉｏｎＢａｓｅ）として標準的に定義されている。
ただし、負荷分散や自動障害回復などの分散システム制
御まで考えると、標準的なＭＩＢだけでは不十分なた
め、計算機の利用状況やディスクの使用状況、およびプ
ログラムの稼働状況などの情報をＭＩＢに拡張定義し
た。A standard network management protocol SNMP (Simple Network) is used for collecting dynamic information.
A network monitoring system using a Management Protocol was used. As shown in Fig. 1, the information collected by the agents allocated to each computer is
The network management station has a mechanism for collecting and displaying. Information collected by the agent is a management information base (MIB, Management Information Form).
function base) is defined as standard.
However, considering distributed system control such as load balancing and automatic failure recovery, standard MIB alone is not enough, so information such as computer usage status, disk usage status, and program operating status is extended to MIB. Defined.

【００２０】このようにしてネットワーク監視システム
が集めた分散システム全体の動的情報は図１に示したよ
うに、分散システム資源情報データベースに定期的に書
き込まれる。データ参照を主目的とする既存の計算資源
情報サービスでは、動的情報収集系からの頻繁なデータ
更新に対応できないが、ここではこの点を改良したクラ
イアント・サーバ方式の分散システム資源情報データベ
ースをＵＮＩＸ標準のデータベースシステムｄｂｍを用
いて作成した。分散システム資源情報データベースが置
かれているマシンにはデータの直接操作を行う分散シス
テム資源情報サーバが存在し、クライアントからのデー
タの参照・追加・変更・削除などの要求を処理する。The dynamic information of the entire distributed system collected by the network monitoring system in this manner is periodically written in the distributed system resource information database as shown in FIG. The existing computational resource information service whose main purpose is data reference cannot support frequent data updates from the dynamic information gathering system, but here is a client / server distributed system resource information database improved on this point. It was created using the standard database system dbm. The machine where the distributed system resource information database is located has a distributed system resource information server that directly manipulates data, and processes requests such as reference, addition, modification, and deletion of data from clients.

【００２１】次に、システム制御系について説明する。
分散システム管理制御手段管理下の各計算機には、各計
算機上のプログラム起動管理や障害対策を行うプログラ
ム実行管理手段を配置する。プログラム実行管理手段は
ユーザやアプリケーションプログラムから見ると、分散
システム管理制御手段へのインターフェイスになってい
る。つまり、ユーザやアプリケーションプログラムは自
分の計算機にあるプログラム実行管理手段との通信方法
さえ知っていれば、分散システム管理制御手段が管理す
る分散システム全体の計算資源と制御機能を利用するこ
とができる。Next, the system control system will be described.
Each computer under the control of the distributed system management control means is provided with a program execution management means for managing the program start-up on each computer and for troubleshooting. The program execution management means is an interface to the distributed system management control means from the viewpoint of the user or the application program. That is, the user or application program can use the computing resources and control functions of the entire distributed system managed by the distributed system management control means as long as they know how to communicate with the program execution management means in their computer.

【００２２】ユーザプログラム（クライアント）がサー
バへの接続を要求した場合のシステム制御系の動作を図
２を用いて説明する。計算機Ａ上のクライアントがサー
バへの接続をプログラム実行管理手段に要求すると
（１）、クライアントからの接続要求は、プログラム実
行管理手段を介して計算機Ｂにある分散システム管理制
御手段インターフェイスに伝えられる（２）。分散シス
テム管理制御手段内では、アクセス権認証手段がサーバ
に対するアクセス権の認証を行った後（３）、サーバ選
択手段がその時点での分散システム資源情報データベー
スの情報を用いて最適なサーバを持つ計算機を選択する
（４）。さらに、障害対策手段がサーバの障害発生に備
えて後述する障害管理テーブルへの登録を行い（５）、
クライアント側のプログラム実行管理手段にサーバの選
択結果を通知する（６）。クライアント側のプログラム
実行管理手段は、選択された計算機Ｃにあるサーバ側の
プログラム実行管理手段にサーバプログラム起動などの
接続準備を要求し（７）、準備が整ってから（８）
（９）、クライアントに接続先を通知する（１０）。こ
の結果、クライアントはサーバと接続し処理を開始する
（１１）。以下では、分散システム管理制御手段内の各
モジュールを構成する技術の要点について説明する。The operation of the system control system when the user program (client) requests connection to the server will be described with reference to FIG. When the client on the computer A requests the program execution management means to connect to the server (1), the connection request from the client is transmitted to the distributed system management control means interface on the computer B via the program execution management means ( 2). In the distributed system management control means, after the access right authentication means authenticates the access right to the server (3), the server selection means uses the information of the distributed system resource information database at that time to have an optimum server. Select a computer (4). Further, the failure countermeasure means registers in the failure management table described later in preparation for the occurrence of a server failure (5),
The program execution management means on the client side is notified of the server selection result (6). The client-side program execution management means requests the server-side program execution management means in the selected computer C to prepare for connection such as server program startup (7), and after preparation is complete (8).
(9) The client is notified of the connection destination (10). As a result, the client connects to the server and starts processing (11). Below, the main points of the technique which comprises each module in a distributed system management control means are demonstrated.

【００２３】まず、計算資源へのアクセス管理について
説明する。アクセス権認証手段は、分散システム管理制
御手段が管理するサーバプログラムなどの計算資源に対
するアクセス管理用のデータベースと、その操作を行う
インタフェースから構成される。アクセス管理用データ
ベースは、計算資源とそれにアクセスできるグループの
組を記述したアクセス情報データベースと、ユーザとユ
ーザが属するセキュリティグループの組を記述したユー
ザ情報データベースの、２つから構成される。ユーザプ
ログラム（クライアント）からサーバプログラムへの接
続要求が来ると、まずユーザ情報データベースを参照し
てユーザが所属するセキュリティグループを調べる。次
にアクセス情報データベースを参照すると、前記セキュ
リティグループがアクセスできるサーバプログラムを持
つ計算機が選択できる。このようにして、ユーザが利用
できる計算資源を認証した上で、選ばれた複数候補の中
からサーバ選択が行われる。なお、アクセス管理用デー
タベースは分散システム資源情報データベースの中に含
まれている。First, the management of access to computing resources will be described. The access right authentication means is composed of a database for access management to a computing resource such as a server program managed by the distributed system management control means, and an interface for operating the database. The access management database is composed of an access information database that describes a set of computational resources and groups that can access it, and a user information database that describes a set of users and security groups to which the users belong. When a connection request from the user program (client) to the server program arrives, the user information database is first referenced to check the security group to which the user belongs. Next, referring to the access information database, a computer having a server program accessible by the security group can be selected. In this way, after the computing resources available to the user are authenticated, the server is selected from the selected plurality of candidates. The access management database is included in the distributed system resource information database.

【００２４】サーバ選択のアルゴリズムについて説明す
る。サーバ選択手段は、選択対象となるサーバを持つ計
算機毎に以下に示す経験式を用いて評価値（Ｅ）を求
め、最も評価値の大きい計算機を選択する。この経験式
の計算に用いる諸データは、情報収集系により分散シス
テム資源情報データベースに集められている。Ｅ＝ｋ１・Ｂｌｏａｄ／Ａｍｉｐｓ＋ｋ２・Ｃｆｍｅｍ
＋ｋ３・Ｄａｃｔｉｖｅここで、Ａｍｉｐｓ：計算機の代表的処理速度（ＭＩＰＳ
値）Ｂｌｏａｄ：過去１分間のｃｐｕのロードアベレー
ジＣｆｍｅｍ：主記憶中のフリーメモリの量Ｄａｃｔｉｖｅ：サーバが起動されているかどうか（１
ｏｒ０）ｋ１，ｋ２，ｋ３：係数。The algorithm for server selection will be described. The server selecting means obtains the evaluation value (E) using the empirical formula shown below for each computer having the server to be selected, and selects the computer having the largest evaluation value. The various data used for the calculation of this empirical formula are gathered in the distributed system resource information database by the information gathering system. E = k1 ・ Broad / Amips + k2 ・ Cfmem
+ K3 · Dactive Here, Amips: Typical processing speed of a computer (MIPS
Value) Bload: Load average of cpu in the past 1 minute Cfmem: Amount of free memory in main memory Dactive: Whether server is started (1
or0) k1, k2, k3: coefficients.

【００２５】障害対策について説明する。管理下のプロ
グラムの障害発生に備えて、分散システム管理制御手段
は分散システム全体のクライアントとサーバの接続状況
一覧表（障害管理テーブル）を分散システム資源情報デ
ータベースに持っている。また、プログラム実行管理手
段は自分が管理する計算機だけに関する部分的な障害管
理テーブルを持っている。これらの管理テーブルの操作
により、分散システム内でクライアントとサーバの接続
管理を行なうのが障害対策手段である。Fault countermeasures will be described. In preparation for a failure of a program under management, the distributed system management control means has a distributed system resource information database with a connection status list (failure management table) of clients and servers of the entire distributed system. Further, the program execution management means has a partial failure management table for only the computer managed by itself. The failure countermeasure means manages the connection between the client and the server in the distributed system by operating these management tables.

【００２６】障害対策では、障害管理テーブルからの該
当エントリの削除、障害履歴リストへの登録、および新
たなサーバのスケジュール、の３つの作業を行う。この
動作を図３に示す。管理下のプログラムの障害を検知す
ると（１）、サーバ側のプログラム実行管理手段は分散
システム管理制御手段インターフェイスに障害発生を通
知し（２）、自分の持つ障害管理テーブルからエントリ
を削除する（３）。分散システム管理制御手段内では、
障害対策手段が分散システム資源情報データベース内の
障害管理テーブルから関連するエントリを削除し
（４）、さらにサーバの再スケジュール時に使用する障
害履歴リストにエントリを登録する（５）。クライアン
ト側では、サーバの応答が無いことから障害を検知し
（６）、これをクライアント側のプログラム実行管理手
段に伝え（７）、障害管理テーブルのエントリを削除し
（８）、サーバへの再接続を要求する（９）（１０）。
障害発生時に新たなサーバと再接続するかどうかは、接
続を要求するクライアント側で選択する。これ以降の処
理は通常のサーバ選択と同じである。In troubleshooting, three operations are performed: deleting the corresponding entry from the failure management table, registering it in the failure history list, and scheduling a new server. This operation is shown in FIG. When the failure of the program under management is detected (1), the program execution management means on the server side notifies the distributed system management control means interface of the failure occurrence (2), and deletes the entry from its own failure management table (3). ). In the distributed system management control means,
The failure countermeasure unit deletes the related entry from the failure management table in the distributed system resource information database (4), and registers the entry in the failure history list used when the server is rescheduled (5). On the client side, the failure is detected because the server does not respond (6), this is notified to the program execution management means on the client side (7), the entry in the failure management table is deleted (8), and the error is sent back to the server. Request a connection (9) (10).
Whether to reconnect to a new server when a failure occurs is selected by the client requesting the connection. Subsequent processing is the same as normal server selection.

【００２７】次に分散システム管理制御手段の多重化と
障害対策について説明する。信頼性や処理の局所化によ
る応答性の向上を狙い、分散システム管理制御手段を多
重化した。ここでは、その機構について述べる。一箇所
の分散システム管理制御手段に処理が集中すると、そこ
がボトルネックとなりシステム全体の効率が低下する。
これを解決するため、複数配置した分散システム管理制
御手段が各計算機のプログラム実行管理手段からの要求
を分担して処理する構成とした。Next, multiplexing of distributed system management control means and failure countermeasures will be described. The distributed system management control means was multiplexed in order to improve reliability and response by localizing processing. Here, the mechanism will be described. When the processing is concentrated on one distributed system management control means, it becomes a bottleneck and the efficiency of the entire system is reduced.
In order to solve this, a configuration is adopted in which a plurality of distributed system management control means share and process requests from the program execution management means of each computer.

【００２８】これは図４に示すようにして実現した。プ
ログラム実行管理手段は、起動時に分散システム内の全
分散システム管理制御手段に対し接続要求を発行する。
これに早く応答した順に、自分の世話役となる分散シス
テム管理制御手段（以下バインダーと称する）と、バイ
ンダー障害時の予備系となる分散システム管理制御手段
（以下セーフティと称する）を決定する。この決め方は
全く任意なので、ある分散システム管理制御手段が異な
るプログラム実行管理手段に対しバインダーとセーフテ
ィの双方の役割を兼ねることもある。バインダーとセー
フティは担当するプログラム実行管理手段が管理するプ
ログラム情報の複製を持っているので、バインダー障害
時でもセーフティに切替えれば、プログラム実行管理手
段からの要求を矛盾なく処理することができる。This was realized as shown in FIG. The program execution management means issues a connection request to all distributed system management control means in the distributed system at the time of startup.
In order of quick response, the distributed system management control means (hereinafter referred to as "binder"), which serves as a caretaker of itself, and the distributed system management control means (hereinafter referred to as "safety"), which serves as a backup system in the event of a binder failure, are determined. Since this determination method is completely arbitrary, a certain distributed system management control means may serve both as a binder and a safety for different program execution management means. Since the binder and the safety have a copy of the program information managed by the program execution management means in charge, the request from the program execution management means can be processed consistently by switching to the safety even if the binder fails.

【００２９】最後に分散システム資源情報データベース
の多重化と障害対策について説明する。耐故障性確保の
ため多重化した各分散システム資源情報データベース
は、状況に応じてマスターとスレーブの２種類の動作を
行う。マスターとなる分散システム資源情報データベー
スは一管理下の分散システムに一つだけ存在し、マスタ
ーの分散システム資源情報データベースだけが更新でき
る。スレーブとなる分散システム資源情報データベース
はは最低１つ以上存在し、スレーブの分散システム資源
情報データベースは参照はできるがマスター以外からの
更新はできない。データ更新要求があると、マスターは
自分のデータベースを更新した後に、全てのスレーブに
要求を送りスレーブの持つ複製データベースを更新す
る。Finally, the multiplexing of the distributed system resource information database and failure countermeasures will be described. Each distributed system resource information database that is multiplexed to ensure fault tolerance performs two types of operations, master and slave, depending on the situation. Only one distributed system resource information database serving as a master exists in a distributed system under management, and only the master distributed system resource information database can be updated. There is at least one distributed system resource information database that is a slave, and the distributed system resource information database of a slave can be referenced but cannot be updated by anyone other than the master. When there is a data update request, the master updates its own database and then sends a request to all slaves to update the replica databases of the slaves.

【００３０】多重化で問題になる分散システム資源情報
データベースの一貫性を保持するため、各データベース
にはデータ変更がある度に更新される識別番号（以下Ｉ
Ｄと称する）がついている。マスターはデータの更新を
行うときに、スレーブのデータベースが持つＩＤを調
べ、マスターのＩＤと一致した時だけ更新処理を行う。
一致しない場合は、複製間のデータに矛盾が生じないよ
うに、マスターのデータベース全体をスレーブにコピー
する。In order to maintain the consistency of the distributed system resource information database, which is a problem in multiplexing, an identification number (hereinafter referred to as I
(Referred to as D). When updating the data, the master checks the ID held by the slave database and performs the update process only when the ID matches the master ID.
If they do not match, copy the entire master's database to the slaves so that the data between the replicas is consistent.

【００３１】マスターに障害が発生した場合の自動障害
回復の動作を図５に示す。最初図５に示す計算機Ａでマ
スターが動作している。マスターの障害発生はマスター
となる分散システム資源情報データベースに要求を発行
するクライアントにより検知される。要求を発行したク
ライアントはマスターの応答がないと、障害が発生した
と判断し（１）、他のスレーブを探すため同報通信を行
う（２）。応答のあったスレーブのうち、最新のＩＤ番
号を持ち、かつクライアントからの同報通信に一早く応
答した計算機Ｂのスレーブを新たなマスターに選び
（３）、これにデータの更新要求を出す（４）。最後に
計算機Ｂ上の新たなマスターは残りのスレーブに更新さ
れたデータを複製し（５）、分散システム資源情報デー
タベースのサービス処理を継続する。FIG. 5 shows the operation of automatic failure recovery when a failure occurs in the master. First, the master is operating on the computer A shown in FIG. The failure of the master is detected by the client that issues a request to the master distributed system resource information database. If there is no response from the master, the client that issued the request determines that a failure has occurred (1), and performs broadcast communication to search for another slave (2). Among the responding slaves, the slave of computer B having the latest ID number and responding to the broadcast communication from the client promptly is selected as a new master (3), and a data update request is issued to this (3). 4). Finally, the new master on the computer B copies the updated data to the remaining slaves (5) and continues the service processing of the distributed system resource information database.

【００３２】[0032]

【発明の効果】この発明ではプログラム実行管理手段
と、分散システム資源情報データベースの情報を用いて
最適なサーバを持つ計算機を選択するサーバ選択手段を
備えた分散システム管理制御手段により、サーバの位置
を意識せずにプログラムを実行することができ、更に、
サーバの選択時にシステムの負荷分散を考慮することが
可能となった。According to the present invention, the position of the server can be controlled by the distributed system management control means having the program execution management means and the server selection means for selecting the computer having the optimum server by using the information of the distributed system resource information database. You can run the program unknowingly,
It became possible to consider the load distribution of the system when selecting the server.

【００３３】また、分散システム管理制御手段が、分散
システム全体のクライアントとサーバの接続状況を登録
する障害管理テーブルを備えていると同時に、プログラ
ム実行管理手段が自己の管理する計算機に関する障害管
理テーブルを備えていることにより、障害発生時にも自
動的に障害を検知し、別サーバを選択して切替えること
ができるので、ユーザは障害の発生を意識することなく
作業を進めることが可能となった。Further, the distributed system management control means has a failure management table for registering the connection status of the clients and servers of the entire distributed system, and at the same time, the program execution management means creates a failure management table for the computer managed by itself. With the provision, the failure can be automatically detected and a different server can be selected and switched, so that the user can proceed with the work without being aware of the failure.

【００３４】プログラム実行管理手段が、１つの分散シ
ステム管理制御手段と、その分散システム管理制御手段
の障害時の予備となる他の１つの分散システム管理制御
手段を決定する手段を備えていることにより、分散シス
テム管理制御手段の障害時にも、個々のプログラム実行
に対して遅滞なく処理を継続することが可能となった。The program execution management means is provided with one distributed system management control means and means for determining another distributed system management control means to be a backup when the distributed system management control means fails. Even when the distributed system management control means fails, it is possible to continue the processing for each program execution without delay.

【００３５】また、分散システム資源情報データベース
も、１つの分散システム資源情報データベースと、その
分散システム資源情報データベース障害時の予備となる
他の１つ以上の分散システム資源情報データベースを備
えていることにより、分散システム資源情報データベー
スの障害時にも、予備の分散システム資源情報データベ
ースに切替えて分散システム資源情報のサービスを自動
的に継続することができ、個々のプログラム実行に対し
て遅滞なく処理を継続することが可能となった。The distributed system resource information database also includes one distributed system resource information database and one or more other distributed system resource information databases that serve as a backup in case of failure of the distributed system resource information database. Even if the distributed system resource information database fails, the distributed system resource information database can be switched over to automatically continue the distributed system resource information service, and the processing can continue without delay for each program execution. It has become possible.

[Brief description of drawings]

【図１】本システムの全体構成を示す。FIG. 1 shows the overall configuration of this system.

【図２】分散システム管理制御手段の管理下で、クライ
アントからサーバへの接続要求を処理する動作を示す。FIG. 2 shows an operation of processing a connection request from a client to a server under the control of distributed system management control means.

【図３】分散システム管理制御手段の管理下で、サーバ
の障害発生時にサーバの自動切替えを行う障害対策の動
作を示す。FIG. 3 shows a failure countermeasure operation for automatically switching servers when a server failure occurs under the control of distributed system management control means.

【図４】分散システム管理制御手段の多重化方法を示
す。FIG. 4 shows a method of multiplexing distributed system management control means.

【図５】分散システム資源情報データベースの自動障害
回復方法を示す。FIG. 5 shows an automatic failure recovery method for a distributed system resource information database.

───────────────────────────────────────────────────── フロントページの続き (72)発明者岡原弘明東京都千代田区丸の内一丁目１番２号日本鋼管株式会社内 (72)発明者中村一成東京都千代田区丸の内一丁目１番２号日本鋼管株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Hiroaki Okahara 1-2-1, Marunouchi, Chiyoda-ku, Tokyo Nihon Kokan Co., Ltd. (72) Issei Nakamura 1-2-1, Marunouchi, Chiyoda-ku, Tokyo Nippon Steel Tube Co., Ltd.

Claims

[Claims]

1. A system execution system by a client-server system of a plurality of computers connected by a network, the program execution management means comprising means installed in the computer and receiving a connection request from a client to a server. And a distributed system management control unit that receives a connection request from the client via the program execution management unit, and a distributed system resource information database, and the distributed system management control unit is a distributed system resource information database. The information processing apparatus includes a server selecting unit that uses information to select a computer having an optimum server and a unit that notifies the program execution managing unit of the selection result of the server selecting unit, and the program execution managing unit is a computer having the server. Request connection preparation including server program startup A distributed computer system management method having means for performing.

2. The distributed system management control means comprises a failure management table for registering the connection status of clients and servers in the entire distributed system, and the program execution management means comprises a failure management table for a computer managed by itself. The distributed computer system management method according to claim 1, wherein

3. A plurality of distributed system management control means are provided in a distributed system, and the program execution management means issues a connection request to all distributed system management control means in the distributed system at the time of startup and has a predetermined order. And a means for deciding one distributed system management control means and another distributed system management control means to be a backup when the distributed system management control means fails, and the two distributed system management control means execute program execution. 2. The distributed computer system management system according to claim 1, wherein each of the distributed computer system management systems has a copy of the program information managed by the management means.

4. A distributed system resource information database of a master, wherein a plurality of the distributed system resource information databases are provided in a distributed system, and one or more remaining distributed system resource information databases are reserved for a master distributed resource information database. When a data update request is made, the master's distributed system resource information database is updated, and then a request is sent to the spare distributed system resource information database to update the spare distributed system resource information database. IDs updated every time data is updated in the distributed system resource information database
If a number is assigned and a failure occurs in the master distributed system resource information database, the one that has the latest ID number in the spare distributed system resource information database and that first responded to the data update request is newly updated. A method for managing a distributed computer system, characterized in that the service processing of the distributed system resource information database is continued by selecting the master system as the distributed system resource information database.