JP2009128987A

JP2009128987A - Computer management system, computer management method and computer management control program

Info

Publication number: JP2009128987A
Application number: JP2007300386A
Authority: JP
Inventors: Takahiro Sokogawa; 貴裕曽小川; Hirotatsu Osaki; 寛達大崎; Yoshifumi Kokado; 能史小角; Takahisa Iwama; 隆寿岩間; Hironobu Sugata; 宏順須賀田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-11-20
Filing date: 2007-11-20
Publication date: 2009-06-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a computer management system and method, and a computer management control program for quickly changing, if a manager of a computer group connected to a network fails, the manager to a new manager managing an agent. <P>SOLUTION: A management node determination section 211 of a manager 201 determines an agent which can serve as a manager between agents 202<SB>1</SB>and 202<SB>2</SB>before the manager 201 fails and stores it in a management node storage section 227 of the current agent. When a manager monitor section 221 detects a failure in the manager 201, a management request section 225 requests the agent 202 thereof to serve as a manager. Consequently, the manager 201 can speedily be switched. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、ネットワークに接続された複数の計算機を管理する計算機管理システム、計算機管理方法および計算機管理制御プログラムに係わり、特に管理する側の計算機に障害が発生したときに好適な計算機管理システム、計算機管理方法および計算機管理制御プログラムに関する。 The present invention relates to a computer management system, a computer management method, and a computer management control program for managing a plurality of computers connected to a network. Particularly, a computer management system and a computer suitable for a failure in a managing computer. The present invention relates to a management method and a computer management control program.

多くのパーソナルコンピュータ、サーバあるいはファクシミリ装置、携帯電話機、ＰＤＡ（Personal Digital Assistants）といったプロセッサを内蔵した機器（以下、単に計算機という。）がインターネットやＬＡＮ（Local Area Network）等のネットワークに接続されるようになってきている。これら計算機はネットワークで接続されることによって特定のグループの間でそのうちの１台の計算機が他の計算機を管理するといったことが可能となっている。 Many personal computers, servers or facsimile machines, mobile phones, PDA (Personal Digital Assistants) built-in processors (hereinafter simply referred to as computers) are connected to networks such as the Internet and LAN (Local Area Network). It is becoming. These computers are connected via a network, so that one computer among them can manage other computers among specific groups.

本明細書では、管理する側の計算機をマネージャと呼び、管理される側の計算機をエージェントと呼ぶことにする。たとえば複数のパーソナルコンピュータが１つのグループとしてインターネットによって接続されており、その中の１台あるいは同一のグループに配置された１台のサーバがマネージャであるとする。すると、マネージャはグループ内のその他のパーソナルコンピュータからなるエージェントの監視を行う。そして、その中の１台に障害が発生したような場合に、これをユーザに通知したり、そのパーソナルコンピュータの設定が許せば再起動処理等の適切な処理を行って障害から復旧するための対応を採ることができる。 In this specification, a managing computer is called a manager, and a managed computer is called an agent. For example, it is assumed that a plurality of personal computers are connected as one group via the Internet, and one of them or one server arranged in the same group is a manager. Then, the manager monitors agents composed of other personal computers in the group. And when a failure occurs in one of them, this is notified to the user, or if the setting of the personal computer permits, an appropriate process such as a restart process is performed to recover from the failure Action can be taken.

このような計算機管理システムを採用すると、マネージャにエージェントそれぞれの情報が集中する。したがって、マネージャはシステム全体を容易に把握することができるという利点が生じる。しかしながら、マネージャにシステム管理のための機能が集中すると、マネージャ自身に障害が発生した場合や、ネットワークに障害が発生してマネージャとエージェント間の通信が途絶えたような場合、エージェントを管理するノードが存在しなくなる。この結果として、マネージャにこのような障害が発生すると、計算機管理システム自体の信頼性が著しく低下することになる。 When such a computer management system is adopted, information of each agent is concentrated on the manager. Therefore, the manager can easily grasp the entire system. However, when the functions for system management are concentrated on the manager, if the manager itself fails or if communication between the manager and the agent is interrupted due to a network failure, the node that manages the agent No longer exists. As a result, when such a failure occurs in the manager, the reliability of the computer management system itself is significantly reduced.

そこで、マネージャとなるノードを二重化することが本発明の第１の関連技術として提案されている（たとえば特許文献１参照）。また、本発明の第２の関連技術では、マネージャとなるノードを二重化すると共に、これらのノードの１つに障害が発生したときにはエージェントをマネージャに仕立て常にマネージャのノードが二重化されている状態を維持する提案を行っている（たとえば特許文献２参照）。 Therefore, it has been proposed as a first related technique of the present invention to duplicate a node serving as a manager (see, for example, Patent Document 1). Further, in the second related technology of the present invention, the node serving as the manager is duplexed, and when one of these nodes fails, the agent is set as the manager and the manager node is always duplexed. (For example, refer to Patent Document 2).

これら第１および第２の関連技術ではマネージャのノードを現用系と予備系に二重化している。しかしながら二重化されたこれらのマネージャのノードに同時に障害が発生する可能性は否定できない。このような障害が発生したときには、これらの技術では通信の継続性への対応が不可能である。また、マネージャのノードを二重化することがリソースの制約上で不可能な計算機管理システムも存在する。 In these first and second related technologies, the manager node is duplicated into an active system and a standby system. However, it cannot be denied that there is a possibility of simultaneous failure of these duplicated manager nodes. When such a failure occurs, these technologies cannot cope with the continuity of communication. There is also a computer management system in which it is impossible to duplicate manager nodes due to resource constraints.

そこで、複数のクライアント端末を管理するクライアント端末用管理サーバの他に、ネットワーク監視サーバと管理サーバを設けるようにした本発明の第３の関連技術が提案されるに至っている（たとえば特許文献３参照）。この第３の関連技術は、管理サーバの負担を軽減することを目的として案出されたものであり、クライアント端末用管理サーバに複数のネットワーク監視サーバのアドレスを登録するようにしている。これにより、ネットワーク監視サーバの１つに障害が発生した場合でもクライアント端末用管理サーバは他のネットワーク監視サーバにアクセスして、所望のネットワーク監視サーバのアドレスを取得することができる。
特開平０６−１９７１１２号公報（第００１３段落、図１）特開２００６−２３５８３７号公報（第００１０段落、図１）特開２００３−２５６３０３号公報（第００２９段落、第００４５段落、第００６０段落、図４） Therefore, a third related technique of the present invention in which a network monitoring server and a management server are provided in addition to a client terminal management server that manages a plurality of client terminals has been proposed (see, for example, Patent Document 3). ). The third related technique has been devised for the purpose of reducing the burden on the management server, and addresses of a plurality of network monitoring servers are registered in the client terminal management server. Thus, even when a failure occurs in one of the network monitoring servers, the client terminal management server can access another network monitoring server and acquire the address of the desired network monitoring server.
Japanese Patent Laid-Open No. 06-197112 (paragraph 0013, FIG. 1) Japanese Patent Laying-Open No. 2006-235837 (paragraph 0010, FIG. 1) JP 2003-256303 A (paragraphs 0029, 0045, 0060, FIG. 4)

ところが、この第３の関連技術はクライアント端末用管理サーバに障害が発生したときに、これに代わってネットワーク監視サーバがクライアント端末の管理を代行するような技術のものではない。すなわち、唯一のクライアント端末用管理サーバを介してネットワーク監視サーバが接続されているので、クライアント端末用管理サーバに障害が発生すると、クライアント端末の管理はこの時点で不可能になる。 However, the third related technique is not a technique in which, when a failure occurs in the client terminal management server, the network monitoring server takes over the management of the client terminal instead. That is, since the network monitoring server is connected via the only client terminal management server, if a failure occurs in the client terminal management server, management of the client terminal becomes impossible at this point.

また、第３の関連技術では、クライアント端末用管理サーバ、ネットワーク監視サーバおよび管理サーバという３層構造のサーバ組織を備えているが、中間に位置するネットワーク監視サーバがサーバリストを共通して備えることが特徴であって、ネットワーク監視サーバのすべてに障害が発生したときはクライアント端末用管理サーバがリスト自体にアクセスできなくなってしまう。 In the third related technology, a server organization having a three-layer structure of a client terminal management server, a network monitoring server, and a management server is provided. In the case where a failure occurs in all of the network monitoring servers, the client terminal management server cannot access the list itself.

そこで本発明の目的は、ネットワークに接続された計算機群におけるマネージャ側に障害が発生したときにエージェントを管理する新たなマネージャに早期に切り替えが可能な計算機管理システム、計算機管理方法および計算機管理制御プログラムを提供することにある。 Accordingly, an object of the present invention is to provide a computer management system, a computer management method, and a computer management control program capable of quickly switching to a new manager for managing an agent when a failure occurs on the manager side in a group of computers connected to a network. Is to provide.

本発明では、（イ）任意の数の管理される側の計算機と、（ロ）これら任意の数の管理される側の計算機とそれぞれネットワークを介して接続され、これら管理される側の計算機の管理に障害が発生する前の段階で障害が発生した時点における管理される側の計算機が実行する障害対応処理の内容を決定する障害対応決定手段を備えた、管理する側の計算機とを計算機管理システムに具備させる。 In the present invention, (b) any number of managed computers, and (b) any number of managed computers connected via a network, respectively, Computer management of the managing computer with failure response determining means for determining the content of the failure response processing executed by the managed computer at the time of the failure before the management failure occurs Provide in the system.

また、本発明では、（イ）ネットワークを介して自装置の管理下に置いた任意の数の計算機と通信を行ってこれらの計算機のそれぞれを自装置に代わって管理するのに適した計算機であるかを判別する管理側計算機判別ステップを計算機管理方法に具備させる。 In the present invention, (a) a computer suitable for managing each of these computers on behalf of the own device by communicating with an arbitrary number of computers under the control of the own device via a network. The computer management method includes a management computer discrimination step for discriminating whether or not there is a computer.

更に本発明では、ネットワークを介して任意の数の管理される側の計算機と接続された、これらの管理される側の計算機を管理する側のコンピュータに、計算機管理制御プログラムとして、（イ）自装置に障害が発生した時点における前記管理される側の計算機が実行する処理内容を時間を置いて逐次決定する決定処理と、（ロ）この決定処理による決定結果を前記管理される側の計算機に通知する通知処理とを実行させることを特徴としている。 Furthermore, according to the present invention, as a computer management control program, a computer management control program connected to an arbitrary number of managed computers connected via a network can be used as a computer management control program. (B) a determination process for sequentially determining the processing contents to be executed by the managed computer at the time when a failure occurs in the apparatus; and (b) a determination result by the determination process is transmitted to the managed computer. It is characterized by executing notification processing for notification.

以上説明したように本発明によれば、ネットワークを介して自装置の管理下に置いた任意の数の計算機と通信を行って管理される側の計算機が実行する障害対応処理の内容を決定することにした。したがって、管理される側にいた他の計算機が新たなマネージャ候補としての判別結果を取得することで、障害発生時に、管理される側の計算機同士でシステムの信頼性を早急に回復できる。また、管理される側の計算機同士が自律分散的に管理するシステムに移行することができる。 As described above, according to the present invention, the content of the failure handling process executed by the managed computer is determined by communicating with an arbitrary number of computers placed under the management of the own apparatus via the network. It was to be. Therefore, the other computers that have been managed acquire the determination result as a new manager candidate, so that the reliability of the system can be quickly recovered between the managed computers when a failure occurs. In addition, it is possible to shift to a system in which managed computers are managed in an autonomous and distributed manner.

次に、本発明の実施の最良の形態を詳細に説明する。 Next, the best mode for carrying out the present invention will be described in detail.

図１は、本実施の形態の計算機管理システムの原理的な構成を表わしたものである。本実施の形態の計算機管理システム１００は、特定の計算機もしくは計算機群（以下、マネージャと呼ぶ。）１０１と、このマネージャ１０１によって管理される他の計算機（以下、エージェントと呼ぶ。）としての第１および第２のエージェント１０２₁、１０２₂によって構成されている。マネージャ１０１と第１および第２のエージェント１０２₁、１０２₂（エージェントの数は任意の正の整数である。）は、ネットワーク１０３によって接続されている。 FIG. 1 shows the basic configuration of the computer management system of this embodiment. The computer management system 100 according to the present embodiment is a first as a specific computer or computer group (hereinafter referred to as a manager) 101 and another computer (hereinafter referred to as an agent) managed by the manager 101. And second agents 102 ₁ and 102 ₂ . The manager 101 and the first and second agents 102 ₁ and 102 ₂ (the number of agents is an arbitrary positive integer) are connected by the network 103.

マネージャ１０１は、その内部に障害の発生に対応するための処理部分として、障害対応決定部１１１と障害対応通知部１１２を配置している。ここで障害対応決定部１１１は、第１または第２のエージェント１０２₁、１０２₂に対する管理内容をマネージャ１０１の障害が発生していないなうちに決定する処理部分である。具体的には、マネージャ１０１の障害時に各エージェントが新たに管理を依頼する最適な他のエージェントをマネージャになるものとして、現在のマネージャ１０１の生存時に予め決定しておく処理部である。障害対応通知部１１２は、マネージャ１０１の障害時に障害対応決定部１１１の決定内容１１３の通知１１４を、第１および第２のエージェント１０２₁、１０２₂に対してネットワーク１０３を通じて行うようにする処理部である。 The manager 101 includes a failure handling determination unit 111 and a failure handling notification unit 112 as processing parts for dealing with the occurrence of a failure. Here, the failure handling determination unit 111 is a processing part that determines the management content for the first or second agent 102 ₁ , 102 ₂ while the manager 101 has not failed. Specifically, it is a processing unit that predetermines when the current manager 101 is alive, assuming that another optimal agent that each agent newly requests to manage when the manager 101 fails becomes the manager. The failure handling notification unit 112 is a processing unit that performs notification 114 of the determination content 113 of the failure handling determination unit 111 to the first and second agents 102 ₁ and 102 ₂ through the network 103 when the manager 101 fails. It is.

本実施の形態の計算機管理システム１００は、このような構成をとることで、実際にマネージャ１０１に障害が発生した際に、各エージェント１０２₁、１０２₂は、決定された他の図示しないエージェントに対して即座に管理を依頼することができる。これにより、マネージャ１０１と第１および第２のエージェント１０２₁、１０２₂の構成による管理形態から、第１および第２のエージェント１０２₁、１０２₂同士の自律分散的な管理に早急に移行することができるので、マネージャ障害時にシステムの信頼性を早急に回復することができることになる。 With this configuration, the computer management system 100 according to the present embodiment allows each of the agents 102 ₁ and 102 ₂ to be determined other agents (not shown) when a failure occurs in the manager 101. On the other hand, management can be requested immediately. Thus, the management mode with the manager 101 according to the first and second agents 102 _1, 102 ₂ configuration, that you immediately shifts to autonomous decentralized management of the first and second agents 102 _1, 102 ₂ to each other Therefore, the reliability of the system can be quickly recovered in the event of a manager failure.

図２は、本実施の形態の計算機管理システムの構成を具体的に表わしたものである。図２で図１と同一部分には同一の符号を付しており、これらの説明を適宜省略する。この計算機管理システム２００は、マネージャ２０１と第１および第２のエージェント２０２₁、２０２₂がネットワーク２０３で接続された構成となっている。 FIG. 2 specifically shows the configuration of the computer management system of this embodiment. In FIG. 2, the same parts as those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted as appropriate. The computer management system 200 has a configuration in which a manager 201 and first and second agents 202 ₁ and 202 ₂ are connected via a network 203.

このうち、マネージャ２０１は、図１における障害対応決定部１１１に対応する管理ノード決定部２１１と、障害対応通知部１１２に対応する管理ノード通知部２１２を備えている。ここで管理ノード決定部２１１は、マネージャ２０１の障害時に対応するものとして、各エージェントとしての第１および第２のエージェント２０２₁、２０２₂を代わって管理する最適な管理ノードの候補を決定する処理部をいう。すなわち、管理ノード決定部２１１で決定される最適な管理ノードとは、あるエージェントが管理する最適なノードではなく、あるエージェントを管理する最適なノードのことを指している。管理ノード通知部２１２は、管理ノード決定部２１１の決定事項を第１および第２のエージェント２０２₁、２０２₂に送信する処理部である。なお、この図では第１および第２のエージェント２０２₁、２０２₂を示しているが、計算機管理システム２００を構成するエージェントの数はこれら２つに限定されるものではない。 Among these, the manager 201 includes a management node determination unit 211 corresponding to the failure handling determination unit 111 in FIG. 1 and a management node notification unit 212 corresponding to the failure handling notification unit 112. Here, the management node determination unit 211 determines the optimal management node candidate to be managed on behalf of the first and second agents 202 ₁ , 202 ₂ as each agent as a response to the failure of the manager 201. Part. That is, the optimal management node determined by the management node determination unit 211 indicates not the optimal node managed by a certain agent but the optimal node that manages a certain agent. The management node notification unit 212 is a processing unit that transmits the determination items of the management node determination unit 211 to the first and second agents 202 ₁ and 202 ₂ . In this figure, the first and second agents 202 ₁ and 202 ₂ are shown, but the number of agents constituting the computer management system 200 is not limited to these two.

第１および第２のエージェント２０２₁、２０２₂は、互いに同一の構成となっている。このため、第１のエージェント２０２₁の構成を中心に説明し、第２のエージェント２０２₂については、説明を適宜省略する。また、第２のエージェント２０２₂については、第１のエージェント２０２₁の構成部分を表わすのに用いた符号に付した数字の添え字の「１」を「２」に置き換えることにする。 The first and second agents 202 ₁ and 202 ₂ have the same configuration. Therefore, the configuration of the first agent 202 ₁ will be mainly described, and the description of the second agent 202 ₂ will be omitted as appropriate. For the second agent 202 ₂ , the numerical subscript “1” attached to the reference numerals used to represent the components of the first agent 202 ₁ is replaced with “2”.

第１のエージェント２０２₁は、マネージャ２０１の監視を行うマネージャ監視部２２１₁と、マネージャ２０１から通知された管理ノード情報２２２₁を受け取る管理ノード受信部２２３₁を備えている。マネージャ監視部２２１₁の監視結果情報２２４₁は、管理依頼部２２５₁に入力されるようになっている。管理依頼部２２５₁は監視結果としてマネージャ２０１の障害状態を検知すると、自ノードの管理依頼２２６₁を他ノードとしての第２のエージェント２０２₂に対して行う。 The first agent 202 ₁ includes a manager monitoring unit 221 ₁ that monitors the manager 201 and a management node receiving unit 223 ₁ that receives management node information 222 ₁ notified from the manager 201. Monitoring result information 224 ₁ of a manager monitoring unit 221 ₁ is adapted to be inputted to the management request unit 225 _1. When the management request unit 225 ₁ detects the failure state of the manager 201 as a monitoring result, it issues a management request 226 ₁ for its own node to the second agent 202 ₂ as another node.

第１のエージェント２０２₁は、管理ノード記憶部２２７₁と、管理判定部２２８₁も備えている。管理ノード記憶部２２７₁は、管理ノード受信部２２３₁で受け取った受信情報２２９₁を保管するようになっている。管理判定部２２８₁は、他ノードとしての第２のエージェント２０２₂の管理依頼部２２５₂から管理依頼２２６₁があったときこれを判定するようになっている。 The first agent 202 _{1 also} includes a management node storage unit 227 ₁ and a management determination unit 228 ₁ . The management node storage unit 227 ₁ stores the reception information 229 ₁ received by the management node reception unit 223 ₁ . The management determination unit 228 ₁ determines this when there is a management request 226 ₁ from the management request unit 225 ₂ of the second agent 202 ₂ as another node.

以上説明したマネージャ２０１と第１および第２のエージェント２０２₁、２０２₂は、共に計算機であるので当然であるが、ＣＰＵ（Central Processing Unit）や、記憶媒体を備えている。記憶媒体にはＣＰＵがこの計算機管理システム２００を実現するための制御プログラムが格納されている。マネージャ２０１に障害が発生したときにその役割を交代する他ノードとしての計算機にも同様にＣＰＵと所定の制御プログラムを格納した記憶媒体が備えられていることも当然である。 The manager 201 and the first and second agents 202 ₁ and 202 ₂ described above are naturally computers, and are of course provided with a CPU (Central Processing Unit) and a storage medium. The storage medium stores a control program for the CPU to realize the computer management system 200. Of course, a computer as another node that changes its role when a failure occurs in the manager 201 is similarly provided with a storage medium storing a CPU and a predetermined control program.

図３および図４は、以上のような構成の計算機管理システムでマネージャの管理移行を可能にするシステム動作の概要を示したものである。このうち、図３は、図２に示す現在のマネージャ２０１の障害対応として、第１および第２のエージェント２０２₁、２０２₂の管理を依頼する最適な他ノードの情報をこれらエージェント２０２₁、２０２₂が保管するまでの障害に対応するための事前の準備動作を示している。また、図４は現在のマネージャ２０１に障害が発生した時に、その管理を他のエージェントに依頼する障害発生時の動作を示している。 FIG. 3 and FIG. 4 show an outline of a system operation that enables managers to transfer management in the computer management system configured as described above. Of these, FIG. 3, as the corresponding fault current manager 201 shown in FIG. 2, the first and second agents 202 _1, 202 of these agents 202 ₁ information optimal other nodes to request the management of _2, 202 ₂ shows pre-preparation operations for dealing with failures until storage. FIG. 4 shows an operation at the time of occurrence of a failure in which when a failure occurs in the current manager 201, the management is requested to another agent.

まず、計算機管理システム２００全体がマネージャ２０１の障害に対応するための事前の準備動作を図３で説明する。図２と共に説明する。最初にマネージャ２０１は、自装置に障害が発生した場合の第１および第２のエージェント２０２₁、２０２₂に対する自装置以外で最適な管理ノードを、管理ノード決定部２１１で決定する（ステップＳ３０１）。この決定は、計算機管理システム２００が最初に起動した場合や、第１および第２のエージェント２０２₁、２０２₂等のエージェント２０２の構成に変更があった場合に行われる。エージェント２０２の構成に変更があれば、最適な管理ノードも変化する場合があるからである。マネージャ２０１がこの処理を障害発生時まで定期的に行ってもよい。 First, a preparatory operation for the computer management system 200 as a whole to cope with a failure of the manager 201 will be described with reference to FIG. This will be described with reference to FIG. First, the manager 201 uses the management node determination unit 211 to determine an optimal management node other than the own device for the first and second agents 202 ₁ and 202 ₂ when a failure occurs in the own device (step S301). . This determination is performed when the computer management system 200 is activated for the first time or when the configuration of the agent 202 such as the first and second agents 202 ₁ and 202 ₂ is changed. This is because if the configuration of the agent 202 is changed, the optimum management node may also change. The manager 201 may perform this process periodically until a failure occurs.

マネージャ２０１はこの決定２３１を自装置内の管理ノード通知部２１２からネットワーク２０３を介して管理ノード情報２２２₁、２２２₂として送出し、第１および第２のエージェント２０２₁、２０２₂に通知される（ステップＳ３０２）。この最適な管理ノード（の候補）に関する通知は、第１および第２のエージェント２０２₁、２０２₂内の管理ノード受信部２２３₁、２２３₂で受信される（ステップＳ３０３）。これらの受信情報２２９₁、２２９₂は、それぞれ対応する管理ノード記憶部２２７₁、２２７₂に記憶される（ステップＳ３０４）。 The manager 201 sends this decision 231 as management node information 222 ₁ , 222 ₂ from the management node notification unit 212 in the own apparatus via the network 203 and notifies the first and second agents 202 ₁ , 202 _2. (Step S302). The notification regarding this optimal management node (candidate) is received by the management node receivers 223 ₁ and 223 ₂ in the first and second agents 202 ₁ and 202 ₂ (step S303). The received information 229 ₁ and 229 ₂ are stored in the corresponding management node storage units 227 ₁ and 227 ₂ (step S304).

すでに説明したようにエージェント２０２の構成に変更があった場合や定期的な処理として、この図３のステップＳ３０１からの処理が所定のタイミングで再度行われた場合には、最適な管理ノード（の候補）に関する受信情報２２９₁、２２９₂が管理ノード記憶部２２７₁、２２７₂に上書き保存されることになる。 As described above, when the configuration of the agent 202 is changed or as a periodic process, when the process from step S301 in FIG. 3 is performed again at a predetermined timing, the optimum management node ( The received information 229 ₁ , 229 ₂ regarding the candidate) is overwritten and saved in the management node storage units 227 ₁ , 227 ₂ .

次に図４の処理を図２と共に説明する。図２に示した第１および第２のエージェント２０２₁、２０２₂内のマネージャ監視部２２１₁、２２１₂は、マネージャ２０１に障害が発生するかを常に監視している（ステップＳ３２１）。そして、マネージャ２０１に障害が発生すると（Ｙ）、第１および第２のエージェント２０２₁、２０２₂の管理依頼部２２５₁、２２５₂は、自装置の管理ノード記憶部２２７₁、２２７₂から自ノードの管理を依頼するノード情報２３３₁、２３３₂を読み出す（ステップＳ３２２）。 Next, the process of FIG. 4 will be described with reference to FIG. The manager monitoring units 221 ₁ and 221 ₂ in the first and second agents 202 ₁ and 202 ₂ shown in FIG. 2 always monitor whether a failure occurs in the manager 201 (step S321). When a failure occurs in the manager 201 (Y), the management requesting units 225 ₁ , 225 ₂ of the first and second agents 202 ₁ , 202 ₂ are notified from the management node storage units 227 ₁ , 227 ₂ of the own device. Node information 233 ₁ and 233 ₂ for requesting node management is read (step S322).

このノード情報２３３₁、２３３₂の取得に成功した場合（ステップＳ３２３：Ｙ）、管理依頼部２２５₁、２２５₂はその取得したノードに対して管理依頼２２６₁、２２６₂を行う（ステップＳ３２４）。この場合、管理依頼の対象となった現時点ではエージェント２０２である装置の管理判定部２２８は、管理を行うかどうか判定を行う（ステップＳ３２５）。この結果として、その装置の管理判定部２２８が管理可能であると判定（あるいは同意）した場合には（ステップＳ３２６：Ｙ）、この判定結果を依頼側に通知した後、そのエージェント２０２が新たなマネージャ２０１となり（ステップＳ３２７）、一連の処理を終了する（エンド）。 If the node information 233 ₁ , 233 ₂ has been successfully acquired (step S323: Y), the management request units 225 ₁ , 225 _{2 make} management requests 226 ₁ , 226 ₂ to the acquired nodes (step S324). . In this case, the management determination unit 228 of the device that is the agent 202 at the present time that is the target of the management request determines whether to perform management (step S325). As a result, when the management determination unit 228 of the device determines (or agrees) that it can be managed (step S326: Y), after notifying the determination result to the requesting side, the agent 202 has a new one. The manager 201 is reached (step S327), and a series of processing ends (end).

たとえば第２のエージェント２０２₂が第１のエージェント２０２₁のマネージャ２０１になるのが最適であると管理ノード記憶部２２７₁に記憶されており、第２のエージェント２０２₂の管理判定部２２８₂がこれを可とする判定を行ったとする。この場合にはマネージャ２０１の障害を第１のエージェント２０２₁が検出した後、第２のエージェント２０２₂が第１のエージェント２０２₁のマネージャ２０１になることになる（ステップＳ３２７）。この例の場合には、マネージャ２０１の交代が迅速に行われることになる。 For example, it is stored in the management node storage unit 227 ₁ that it is optimal that the second agent 202 ₂ becomes the manager 201 of the first agent 202 ₁ , and the management determination unit 228 _{2 of} the second agent 202 ₂ Suppose that it is determined that this is acceptable. In this case, after the first agent 202 ₁ detects the failure of the manager 201, the second agent 202 ₂ becomes the manager 201 of the first agent 202 ₁ (step S327). In the case of this example, the manager 201 is quickly replaced.

これに対して、ステップＳ３２３で管理ノード記憶部２２７₁（あるいは管理ノード記憶部２２７₂）に最適な管理ノードの候補が記憶されておらず、取得できない場合がある（ステップＳ３２３：Ｎ）。このような場合には、従来から行われている一般的な手法で、マネージャ２０１となる管理ノードを探索する（ステップＳ３２８）。また、候補とされたエージェント２０２が管理可能ではないと判定した場合（ステップＳ３２６：Ｎ）も同様である。この場合には、新たなマネージャ２０１が定まるまで、管理が比較的長い時間にわたって不能になる可能性もある。 On the other hand, in step S323, the optimal management node candidate is not stored in the management node storage unit 227 ₁ (or the management node storage unit 227 ₂ ) and may not be acquired (step S323: N). In such a case, a management node to be the manager 201 is searched for by a general method conventionally used (step S328). The same applies to the case where it is determined that the candidate agent 202 cannot be managed (step S326: N). In this case, management may be disabled for a relatively long time until a new manager 201 is determined.

このように本実施の形態によれば、マネージャ２０１の障害時に、すでに決定してある最適な管理ノードの情報を利用することで、マネージャ主導の管理から自律分散的な管理へ早急に切り替えることができるため、システムの信頼性が低下している時間を短縮することができるという効果がある。また、マネージャ２０１が新たなマネージャとなるべき候補を決定しても、その候補に決定されたエージェント２０２はマネージャとなるかを判定（あるいは同意）するようにしたので、マネージャとして無理のない移行が可能になる。 As described above, according to the present embodiment, at the time of failure of the manager 201, it is possible to quickly switch from manager-led management to autonomous distributed management by using the information of the optimal management node that has already been determined. Therefore, there is an effect that the time during which the reliability of the system is lowered can be shortened. Even if the manager 201 determines a candidate to become a new manager, the agent 202 determined as the candidate determines (or agrees) whether to become a manager. It becomes possible.

図５は、本発明の一実施例における計算機管理システムの構成を表わしたものである。この簡略化された図で示されるように本実施例の計算機管理システム４００では、第１〜第３のノード４０１〜４０３が、インターネット等の共通のネットワーク４０４によって相互に接続されている。 FIG. 5 shows the configuration of the computer management system in one embodiment of the present invention. As shown in this simplified diagram, in the computer management system 400 of this embodiment, the first to third nodes 401 to 403 are connected to each other by a common network 404 such as the Internet.

図６は、本実施例の各ノードの共通した構成を示したものである。マネージャにもエージェントにもなり得るノードとしての計算機４１１は、マネージャとして機能するマネージャ部４１２と、エージェントとして機能するエージェント部４１３と、これらの機能を切り替える機能切替部４１４と、図５に示したネットワーク４０４と通信するネットワーク通信部４１５を備えている。この計算機４１１は、機能切替部４１４によってマネージャ部４１２の方を機能させたとき、マネージャとなる。また、機能切替部４１４によってエージェント部４１３の方を機能させたときには、エージェントとなる。 FIG. 6 shows a common configuration of each node in this embodiment. A computer 411 as a node that can be a manager or an agent includes a manager unit 412 that functions as a manager, an agent unit 413 that functions as an agent, a function switching unit 414 that switches these functions, and the network illustrated in FIG. A network communication unit 415 that communicates with 404 is provided. The computer 411 becomes a manager when the function switching unit 414 causes the manager unit 412 to function. Also, when the function switching unit 414 causes the agent unit 413 to function, it becomes an agent.

マネージャ部４１２は、図２に示した管理ノード決定部２１１と管理ノード通知部２１２で構成されている。管理ノード通知部２１２は機能切替部４１４を介してネットワーク通信部４１５と接続されている。管理ノード決定部２１１と管理ノード通知部２１２は具体的に説明したので、これらの説明は省略する。 The manager unit 412 includes the management node determination unit 211 and the management node notification unit 212 shown in FIG. The management node notification unit 212 is connected to the network communication unit 415 via the function switching unit 414. Since the management node determination unit 211 and the management node notification unit 212 have been specifically described, description thereof will be omitted.

エージェント部４１３は、機能切替部４１４を介してネットワーク通信部４１５と接続されたエージェント通信部４２１を備えている。エージェント通信部４２１はネットワーク通信部４１５と通信する部分であり、管理ノード受信部２２３、マネージャ監視部２２１、管理依頼部２２５および管理判定部２２８と接続されている。管理ノード記憶部２２７は計算機４１１内の図示しない不揮発性メモリの一部領域を構成しており、マネージャ監視部２２１および管理依頼部２２５と接続されている。また、マネージャ監視部２２１と管理依頼部２２５は直接接続されている。エージェント部４１３内のエージェント通信部４２１を除く各部は、図２に示した各処理部と内容が変わらないので、その詳細な説明は省略する。 The agent unit 413 includes an agent communication unit 421 connected to the network communication unit 415 via the function switching unit 414. The agent communication unit 421 is a part that communicates with the network communication unit 415, and is connected to the management node reception unit 223, the manager monitoring unit 221, the management request unit 225, and the management determination unit 228. The management node storage unit 227 constitutes a partial area of a nonvolatile memory (not shown) in the computer 411 and is connected to the manager monitoring unit 221 and the management request unit 225. The manager monitoring unit 221 and the management request unit 225 are directly connected. Since the components other than the agent communication unit 421 in the agent unit 413 are the same as the respective processing units illustrated in FIG. 2, detailed description thereof is omitted.

機能切替部４１４は、前記した不揮発性メモリの他の領域で構成される機能分担テーブル４１６と接続されており、この内容に応じて、相手ノードとの関係でマネージャ部４１２とエージェント部４１３のいずれか一方が機能するように設定されている。 The function switching unit 414 is connected to a function sharing table 416 configured by other areas of the nonvolatile memory described above, and depending on the contents, either of the manager unit 412 or the agent unit 413 is related to the counterpart node. Either one is set to work.

図７は、障害が発生する前の初期状態における機能分担テーブルの内容を表わしたものである。この時点を第１の時点ｔ₁とする。機能分担テーブル４１６には、第１の時点ｔ₁で、たとえば図５に示す計算機管理システム４００の管理者が、それぞれ管理されるノードと管理するノードの割り当てを行っている。図５と共に説明する。 FIG. 7 shows the contents of the function sharing table in the initial state before the failure occurs. This time is defined as a first time t ₁ . The function sharing table 416, at a first time point t _1, e.g. administrator of the computer management system 400 shown in FIG. 5, and assigns node managing a node to be managed respectively. This will be described with reference to FIG.

この機能分担テーブル４１６で、管理されるノードがエージェントとしてのノードであり、管理するノードがマネージャとしてのノードである。この例では、第１のノード４０１と第３のノード４０３の間では前者がエージェントであり、後者がマネージャとなっている。第２のノード４０２と第１のノード４０１との間では、前者がエージェントであり、後者がマネージャとなっている。第３のノード４０３と第２のノード４０２との間では前者がエージェントであり、後者がマネージャとなっている。 In the function sharing table 416, the managed node is a node as an agent, and the managed node is a node as a manager. In this example, the former is an agent and the latter is a manager between the first node 401 and the third node 403. Between the second node 402 and the first node 401, the former is an agent and the latter is a manager. Between the third node 403 and the second node 402, the former is an agent and the latter is a manager.

そこで、一例として第１のノード４０１と第３のノード４０３の関係について具体的に考察してみる。この例では、マネージャとしての第３のノード４０３が第１の時点ｔ₁よりも後の第２の時点ｔ₂で障害を発生させるものとする。この第２の時点ｔ₂よりも前の時点で、第３のノード４０３は第２のノード４０２がエージェントとしての第１のノード４０１を管理するのに最適なノードであると決定するものとする。 Therefore, as an example, the relationship between the first node 401 and the third node 403 will be specifically considered. In this example, it is assumed that the third node 403 as a manager generates a failure at a _second time t ₂ after the _first time t ₁ . It is assumed that the third node 403 determines that the second node 402 is the most suitable node for managing the first node 401 as an agent at a time before the _second time t _2. .

図８は、相手のノードとの関係で管理する側のマネージャとなるノードの処理の様子を表わしたものである。図５および図６と共に説明する。第３のノード４０３は機能分担テーブル４１６で第１のノード４０１との関係でマネージャとしての機能を備えているので、まず、自装置にエージェントとしての第１のノード４０１との関係で障害が発生したとき、マネージャとしての役割を果たすことのできるノードの決定を行う（ステップＳ５０１）。 FIG. 8 shows the state of processing of a node serving as a manager on the management side in relation to the partner node. This will be described with reference to FIGS. Since the third node 403 has a function as a manager in relation to the first node 401 in the function sharing table 416, first, a failure occurs in the own device in relation to the first node 401 as an agent. Then, a node that can play a role as a manager is determined (step S501).

ここでまず、ノードの決定の様子を説明する。マネージャとしての第３のノード４０３の管理ノード決定部２１１は、エージェントとしての第１のノード４０１等の各エージェントに対して適宜通信を行い、各種の情報を収集することができる。この結果、得られた各エージェントのＣＰＵの負荷やメモリの使用量といった情報を基にして、第３のノード４０３の管理ノード決定部２１１は自装置に障害が発生した際の第１のノード４０１に最適なノードの決定を行う。 First, the state of node determination will be described. The management node determination unit 211 of the third node 403 as a manager can appropriately communicate with each agent such as the first node 401 as an agent and collect various types of information. As a result, based on the obtained information such as the CPU load and memory usage of each agent, the management node determination unit 211 of the third node 403 first node 401 when a failure occurs in the own device. The most suitable node is determined.

この決定のタイミングは、監視情報が収集された時であってもよいし、マネージャとしての第３のノード４０３の負荷が低い監視の行いやすい時で、各エージェントのＣＰＵ負荷の急激な上昇を行った場合のように状態が大幅に変化した時であってもよい。もちろん、これ以外の各種の場合でもよい。管理ノード決定部２１１がどのタイミングで決定を行うかは、システムの要件に応じて変更しうる。たとえば、監視情報が収集された時に決定する場合には、最新の監視情報を利用できるため、常に最新の状態での管理ノードの情報を維持することができる。これに対して、マネージャの負荷が低い時に決定することにすると、第３のノード４０３の本来の処理動作への影響を最小限とすることができる。 The timing of this determination may be when monitoring information is collected, or when the load of the third node 403 as a manager is low and it is easy to perform monitoring, and the CPU load of each agent increases rapidly. It may be when the state has changed drastically as in the case of. Of course, various other cases may be used. The timing at which the management node determination unit 211 determines can be changed according to system requirements. For example, when determining when monitoring information is collected, since the latest monitoring information can be used, the information of the management node in the latest state can always be maintained. On the other hand, if the decision is made when the manager load is low, the influence on the original processing operation of the third node 403 can be minimized.

図８に戻って説明を続ける。第３のノード４０３が第１のノード４０１についての最適なマネージャとなるノードが存在すると判別した場合には（ステップＳ５０２：Ｙ）、その最適なマネージャとなるノードを対応するエージェントとしての第１のノード４０１に通知する（ステップＳ５０３）。そして、次の決定のためのタイミングが到来するまで待機し（ステップＳ５０４：Ｎ）、先に説明したタイミングが到来したら（Ｙ）、ステップＳ５０１の処理に戻る（リターン）。このようにして、環境の変化に対応して最適なマネージャとなるノードを適宜変更できるようにしている。 Returning to FIG. If the third node 403 determines that there is a node that is the optimal manager for the first node 401 (step S502: Y), the node that is the optimal manager is the first agent as the corresponding agent. The node 401 is notified (step S503). And it waits until the timing for the next determination arrives (step S504: N), and if the timing demonstrated previously comes (Y), it will return to the process of step S501 (return). In this way, the node serving as the optimum manager can be changed as appropriate in response to changes in the environment.

ステップＳ５０２で最適なマネージャとなるノードが存在しないと判別された場合には（Ｎ）、該当するノードが存在しないとするエラーを対応するエージェントとしての第１のノード４０１に通知する（ステップＳ５０５）。この場合にもステップＳ５０４に進んで、次のタイミングでステップＳ５０１の判定を行うことになる。 If it is determined in step S502 that there is no optimal manager node (N), an error indicating that the corresponding node does not exist is notified to the first node 401 as the corresponding agent (step S505). . Also in this case, the process proceeds to step S504, and the determination in step S501 is performed at the next timing.

図９は、エージェント側の処理の流れを表わしたものである。図５および図６と共に説明する。この例でエージェントである第１のノード４０１は、マネージャとしての第３のノード４０３から図８のステップＳ５０３あるいはステップＳ５０５の通知が受信されるのを待機している（ステップＳ５２１）。そして、この通知を受信すると（Ｙ）、自装置の管理ノード記憶部２２７に通知の内容を上書きして更新する（ステップＳ５２２）。たとえば第３のノード４０３から第２のノード４０２が最適なマネージャとなるノードであると通知を受けたものとすると、管理ノード記憶部２２７に「第２のノード」と上書きすることになる。エラーの通知が来た場合には、管理ノード記憶部２２７に「該当ノードなし」と上書きする。たとえば第２のノード４０２が過負荷の状態のときに第３のノード４０３によるチェックを受けたような場合がそれである。 FIG. 9 shows the flow of processing on the agent side. This will be described with reference to FIGS. In this example, the first node 401 serving as an agent waits for the notification of step S503 or step S505 in FIG. 8 to be received from the third node 403 serving as a manager (step S521). When this notification is received (Y), the management node storage unit 227 of the own apparatus is overwritten with the notification content and updated (step S522). For example, if the third node 403 is notified that the second node 402 is the optimum manager node, the management node storage unit 227 is overwritten with “second node”. If an error notification is received, the management node storage unit 227 is overwritten with “no corresponding node”. For example, the second node 402 is checked by the third node 403 when it is overloaded.

エージェントである第１のノード４０１は、この他に他のマネージャからマネージャとしての管理依頼を受信する場合がある（ステップＳ５２３：Ｙ）。この場合、第１のノード４０１は自装置の管理判定部２２８で自装置が指定されたノードをマネージャとして管理可能であるかを判定する（ステップＳ５２４）。この判定は、自装置の将来の負荷の状態の予測や自装置がすでに他のノードのマネージャとなっているかといった各種の状況を材料として行われる。 In addition, the first node 401 as an agent may receive a management request as a manager from another manager (step S523: Y). In this case, the first node 401 determines whether or not the management determination unit 228 of the own device can manage the node designated by the own device as a manager (step S524). This determination is made using various situations such as prediction of the future load state of the own device and whether the own device is already a manager of another node.

この判定でマネージャとしての管理が可能であると判定した場合には（ステップＳ５２５：Ｙ）、管理依頼先に管理が可能である旨の通知を行う（ステップＳ５２６）。この場合には、すぐに管理が実行されるので自装置の機能分担テーブル４１６該当する欄の記載を変更する。たとえば、第３のノード４０３と第１のノード４０１との関係で、第３のノード４０３がマネージャとして管理できない状況になったとき、第１のノード４０１が第２のノード４０２にマネージャとしての管理を依頼してきたとする。この場合、第２のノード４０２が管理依頼を受ける場合には、自装置の機能分担テーブル４１６の該当する「管理するノード」を「第２のノード」に書き換える（ステップＳ５２７）。 If it is determined that management as a manager is possible (step S525: Y), the management request destination is notified that management is possible (step S526). In this case, since management is executed immediately, the description in the column corresponding to the function sharing table 416 of the own apparatus is changed. For example, when the third node 403 cannot be managed as a manager due to the relationship between the third node 403 and the first node 401, the first node 401 manages the second node 402 as a manager. Suppose that In this case, when the second node 402 receives the management request, the corresponding “node to be managed” in the function sharing table 416 of the own device is rewritten to “second node” (step S527).

図１０は、機能分担テーブルが図７の状態から書き換えられた状態を示したものである。図７と対比すれば分かるように第１のノード４０１を「管理するノード」が、第３のノード４０３から第２のノード４０２に変更されている。これにより、第２のノード４０２は、第１および第３のノード４０１、４０３の双方を管理するマネージャとなる。もちろん、たとえば第２のノード４０２が第３のノード４０３との関係でマネージャとしての適格性を欠いたような場合には、代わって第１のノード４０１が第３のノード４０３を管理するマネージャとなるといった変更が将来生じる可能性もある。 FIG. 10 shows a state where the function assignment table is rewritten from the state of FIG. As can be seen from comparison with FIG. 7, the “node that manages” the first node 401 is changed from the third node 403 to the second node 402. Accordingly, the second node 402 becomes a manager that manages both the first and third nodes 401 and 403. Of course, for example, in the case where the second node 402 lacks eligibility as a manager in relation to the third node 403, the first node 401 is replaced with the manager that manages the third node 403. Such changes may occur in the future.

図９に戻って説明を続ける。ステップＳ５２５で第１のノード４０１の管理が可能でないと判定された場合には（Ｎ）、管理が不可能であることの通知が管理依頼先のノードに通知される（ステップＳ５２８）。 Returning to FIG. 9, the description will be continued. If it is determined in step S525 that the first node 401 cannot be managed (N), the management request destination node is notified that the management is impossible (step S528).

ところで、第３のノード４０３が第１のノード４０１を管理している図７に示す状態で、エージェントである第１のノード４０１は第３のノード４０３がマネージャとしての機能を果たしているかどうかをマネージャ監視部２２１で監視する。具体的にはマネージャの監視時機が到来すると（ステップＳ５２９：Ｙ）、自装置との関係で、たとえばその生存を確認する（ステップＳ５３０）。そして、たとえば生存確認のメッセージに応答したことで生存が確認されれば（ステップＳ５３１：Ｙ）、そのまま何もしないで処理を終了する（リターン）。 By the way, in the state shown in FIG. 7 in which the third node 403 manages the first node 401, the first node 401 as an agent determines whether the third node 403 functions as a manager. Monitoring is performed by the monitoring unit 221. Specifically, when the manager's monitoring time comes (step S529: Y), for example, the existence of the manager is confirmed in relation to the own device (step S530). For example, if survival is confirmed by responding to a survival confirmation message (step S531: Y), the processing is terminated without doing anything (return).

これに対して生存が確認されなかった場合には（ステップＳ５３１：Ｎ）、マネージャを他のノードに代わって行ってもらう必要がある。そこで自装置の管理ノード記憶部２２７から最適な管理ノードの候補を読み出す（ステップＳ５３２）。このとき、マネージャとなるノードが読み出されないといったエラーが発生しなければ（ステップＳ５３３：Ｎ）、その読み出したノードに管理依頼を通知する（ステップＳ５３４）。先の例で第３のノード４０３に障害が発生した場合には、管理ノード記憶部２２７から第２のノード４０２を指定するデータが読み出され、これに管理依頼が通知されることになる。 On the other hand, if survival is not confirmed (step S531: N), it is necessary to have the manager take the place of another node. Therefore, an optimum management node candidate is read from the management node storage unit 227 of the own apparatus (step S532). At this time, if an error that a node serving as a manager cannot be read does not occur (step S533: N), a management request is notified to the read node (step S534). If a failure occurs in the third node 403 in the previous example, data specifying the second node 402 is read from the management node storage unit 227, and a management request is notified to this.

これに対して、ステップＳ５３３でエラーが発生した場合には（Ｙ）、マネージャとなるノードが決定できない状態となる。そこでこの場合には、一般的な管理ノードの探索処理が実行される（ステップＳ５３５）。本実施例では、説明を簡略化するために第１〜第３のノード４０１〜４０３のみが計算機管理システム４００を構成しているが、これよりも多いノードがシステムを構成している場合、残りのノードに対しても管理する側のノードに成り得るかのチェックが行われることになる。ステップＳ５３４で管理依頼を行った先のノード（この例の場合には第２のノード４０２）から管理が不可能である旨の受信があった場合（ステップＳ５３６：Ｙ）も、同様にステップＳ５３５に進んで、一般的な管理ノードの探索処理が実行されることになる。 On the other hand, if an error has occurred in step S533 (Y), the manager node cannot be determined. Therefore, in this case, a general management node search process is executed (step S535). In the present embodiment, only the first to third nodes 401 to 403 constitute the computer management system 400 for the sake of simplification. However, if more nodes than this constitute the system, the rest This node is also checked whether it can become a managing node. Similarly, when a message indicating that management is impossible is received from the previous node (the second node 402 in this example) that has made the management request in step S534 (step S536: Y), step S535 is also performed. Then, a general management node search process is executed.

以上説明したように本実施例によれば、マネージャとなったノードが自装置に代わる装置を次のマネージャとなるノードとして選択することにしたので、個々のエージェントにマネージャを選択させる場合と比べて複数の候補の間で１つのマネージャを選択するといった調整が不要となる。また、エージェント側のノードもマネージャの選択の仕事から解放されるので、自装置やネットワークの負荷を軽減することができるという長所がある。 As described above, according to this embodiment, since the node that becomes the manager selects the device that replaces the own device as the node that becomes the next manager, as compared with the case where each agent selects the manager. Adjustment such as selecting one manager among a plurality of candidates becomes unnecessary. In addition, since the agent side node is also freed from the task of manager selection, there is an advantage that the load on the own device and the network can be reduced.

なお、実施例ではマネージャに対して生存確認のメッセージを送信することにしたが、マネージャからの応答時間の遅延を測定したり、マネージャの負荷の状態の報告を受けることで、生存状態であっても他のエージェントをマネージャとして交代させるようにしてもよいことは当然である。 In the embodiment, it is decided to send a survival confirmation message to the manager. However, it is possible to measure the response time delay from the manager or to receive a report on the manager load status. Of course, other agents may be replaced as managers.

また、複数のエージェントの中からマネージャとなるノードを決定する際には、各エージェントがマネージャとなった割合といったような過去の実績をデータとして保存しておき、これを参考にしてもよい。 Further, when determining a node to be a manager from among a plurality of agents, past results such as the ratio of each agent becoming a manager may be stored as data and used as a reference.

更に実施例では各ノードが図６に示すようなマネージャとしての機能とエージェントとしての機能を予め持っているものとして説明したが、特定のノードについてはこれらの機能の一方のみを持ったノードとして構成されていてもよい。 Furthermore, in the embodiment, each node has been described as having a manager function and an agent function as shown in FIG. 6, but a specific node is configured as a node having only one of these functions. May be.

更にまた実施例では、マネージャが新たなマネージャとなり得る最適な１つのエージェントをその候補としたが、優先順位を付けて２以上の候補を通知するようにしてもよい。この場合、優先順位の最も高いエージェントに最初にマネージャへの依頼が行われ、同意が得られない場合には順位を繰り下げた依頼が行われる。これにより、管理依頼が拒絶された場合のリスクを軽減させることができる。 Furthermore, in the embodiment, the optimum agent that can become a new manager is selected as the candidate, but two or more candidates may be notified with priorities. In this case, the agent with the highest priority is first requested to the manager, and if the consent is not obtained, the request with a lower rank is made. Thereby, the risk when the management request is rejected can be reduced.

本実施の形態の計算機管理システムの原理図である。It is a principle figure of the computer management system of this Embodiment. 本実施の形態の計算機管理システムの構成を具体的に表わしたブロック図である。It is a block diagram showing concretely the composition of the computer management system of this embodiment. 本実施の形態で障害に対応するためのマネージャの事前の準備動作を示す流れ図である。It is a flowchart which shows the manager's advance preparation operation | movement for responding to a failure in this Embodiment. 本実施の形態で障害が発生した時の管理を他のエージェントに依頼するエージェント側の動作を示す流れ図である。6 is a flowchart showing an operation on the agent side that requests another agent to perform management when a failure occurs in the present embodiment. 本発明の一実施例における計算機管理システムの構成の概要を表わしたシステム構成図である。1 is a system configuration diagram showing an outline of a configuration of a computer management system in an embodiment of the present invention. 本実施例の各ノードの共通した構成を示したブロック図である。It is the block diagram which showed the common structure of each node of a present Example. 障害が発生する前の初期状態における機能分担テーブルの内容を表わした説明図である。It is explanatory drawing showing the content of the function allocation table in the initial state before a failure generate | occur | produces. 本実施例で相手のノードとの関係で管理する側のマネージャとなるノードの処理の様子を表わした流れ図である。It is a flowchart showing the mode of processing of the node which becomes the manager of the side which manages in relation to the partner node in the present embodiment. 本実施例でエージェント側の処理の流れを表わした流れ図である。It is a flowchart showing the flow of processing on the agent side in the present embodiment. 機能分担テーブルが図７の状態から書き換えられた状態を示した説明図である。It is explanatory drawing which showed the state by which the function allocation table was rewritten from the state of FIG.

Explanation of symbols

１００、２００、４００計算機管理システム
１０１、２０１マネージャ
１０２₁、２０２₁ 第１のエージェント
１０２₂、２０２₂ 第２のエージェント
１０３、２０３、４０４ネットワーク
１１１障害対応決定部
１１２障害対応通知部
２１１管理ノード決定部
２１２管理ノード通知部
２２１マネージャ監視部
２２５管理依頼部
２２７管理ノード記憶部
２２８管理判定部
４０１第１のノード
４０２第２のノード
４０３第３のノード
４１１計算機
４１２マネージャ部
４１３エージェント部
４１４機能切替部
４１６機能分担テーブル 100, 200, 400 Computer management system 101, 201 Manager 102 ₁ , 202 ₁ First agent 102 ₂ , 202 ₂ Second agent 103, 203, 404 Network 111 Failure response determination unit 112 Failure response notification unit 211 Management node determination Unit 212 management node notification unit 221 manager monitoring unit 225 management request unit 227 management node storage unit 228 management determination unit 401 first node 402 second node 403 third node 411 computer 412 manager unit 413 agent unit 414 function switching unit 416 Function sharing table

Claims

Any number of managed computers, and
A computer on the managed side at the time when a failure occurs in the stage before a failure occurs in the management of the managed computer connected to any number of these managed computers via a network. And a managing computer having a failure response determining means for determining the content of the failure response process executed by the computer.

2. The computer management system according to claim 1, wherein the managing computer includes notification means for notifying the managed computer of the determination result of the failure handling determining means.

The failure response determining means of the managing computer repeats the process of determining a computer that can replace the own device at an interval from the time when the own device becomes the managing computer. 3. The computer management system according to claim 2, wherein the notification means notifies the determination result to the corresponding managed computer every time the failure response determination means determines.

4. The failure handling determining means selects the managing computer from the managed computers as a computer that can replace the managing computer. The computer management system described in any one.

The replaceable computer determined by the failure response determining means receives the request from the managed computer when a failure occurs in the management of the managed computer, and is managed by the own apparatus. The computer management system according to claim 1, wherein the migration is performed by consenting to the migration to the computer on the side.

The replaceable computers determined by the failure handling determining means are a plurality of prioritized computers, and the consent to move to the managing computer is obtained from the one with the higher priority. The computer management system according to claim 5.

It comprises a failure response determining step for communicating with an arbitrary number of computers placed under the management of its own device via a network and determining the processing contents to be executed when a failure occurs in each of these computers. Computer management method.

8. The notification step of notifying, via the network, the processing contents to be executed when a failure occurs in a management computer in the failure handling determination step. Computer management method.

The management computer determining step is characterized in that the managing computer repeats the process of determining the content of the failure handling process at intervals, and the determination result is obtained in the notification step each time the determination process is performed. 9. The computer management method according to claim 8, wherein the computer is notified to the corresponding managed computer.

A computer that manages these managed computers connected to any number of managed computers via the network,
A determination process for sequentially determining the processing contents to be executed by the managed computer at the time when a failure occurs in the own device; and
A computer management control program for executing notification processing for notifying the managed computer of the determination result of the determination processing.