CN100334557C

CN100334557C - Method for selecting intermediate proxy node of cluster network

Info

Publication number: CN100334557C
Application number: CNB021421641A
Authority: CN
Inventors: 程菊生; 吴雪丽; 胡毅; 金正操; 顾光导
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2002-06-10
Filing date: 2002-08-27
Publication date: 2007-08-29
Anticipated expiration: 2022-08-27
Also published as: CN1466055A

Abstract

The present invention relates to a method for cluster network communications, in which intermediate agent nodes are introduced. When a system is started, the system can automatically select the intermediate agent nodes; in the process of system running, when the intermediate agent nodes generate failures and can not complete specified functions, the system can select new intermediate agent nodes.

Description

The system of selection of cluster network middle-agent node

Technical field

The present invention relates to the communication means of computer group network, relate in particular to the system of selection of cluster network communication middle-agent node.

Background technology

Computing machine various abnormal conditions may occur in operational process, the computer administrator need understand its running status at any time, in time learns the abnormal conditions of appearance, and handles accordingly, guarantees the safe and stable operation of computer system.

Computer cluster system is that multiple servers (node computer) is formed, and they condense together by private high network, constitute a superserver.In actual applications, it is particularly important that the safe and stable operation of Network of Workstation seems, therefore, be necessary the ruuning situation of all the node computer software and hardwares in the Network of Workstation is monitored, at any time pinpoint the problems, and the eliminating fault, and people more wish whole Network of Workstation is monitored as a single reflection.This just needs the supervisory system that can monitor whole Network of Workstation.

The communication plan of a kind of supervisory system employing of often using at present, is: monitoring host computer is directly communicated by letter with each node computer, obtains monitor message (as shown in Figure 1).Earlier, directly send monitoring host computer 1 then to, realize the monitoring of monitoring host computer each server by running on the running state information that Agent on each node computer server obtains place node computer 2.

There are many obvious defects in the existing communication method:

At first, existing scheme is only applicable to the less situation of nodal point number, when the node number increases to some, adopts this direct communication mode just can not meet the demands.For example, a group of planes of being monitored has 256 nodes, if adopt TCP (transmission control protocol) as basic communication protocol, then monitoring host computer need be kept 256 TCP connections, and this can take a large amount of system resource, even can not realize at all.If adopt UDP (User Datagram Protoco (UDP)) as basic communication protocol, then monitoring host computer might be received a large amount of UDP bags at one time, in case monitoring host computer fails in time to handle these bags, probably packet drop appears just, promptly go out the situation of active monitor message.For this situation, also there is not solution preferably at present.

Secondly, supervisory system is as the important component part of Network of Workstation, and as the backstage service operation, it can not take too much system resource and influence the operation of other application of Network of Workstation.And according to existing supervisory system communication plan, the operation of supervisory system can take a large amount of system resource, thereby disturbs the normal operation of Network of Workstation.For this reason, need a kind of new implementation, lack occupying system resources as much as possible, make the operation expense of supervisory system in whole Network of Workstation reduce to minimum.

Once more, the communication plan of existing supervisory system can not guarantee the synchronism of each node data acquisition be that is to say well, can not synchronously collect in the ruuning situation of synchronization each node computer.Like this, just the overall operation situation of Network of Workstation can not be understood objective, exactly, nor the single imaged features of whole Network of Workstation can be embodied.

Above-mentioned defective based on existing supervisory system communication plan, we press for a kind of new technical solution, can be applicable to the large-scale Network of Workstation of many nodes, under the prerequisite that does not take too much system resource, synchronously the ruuning situation of each node computer be monitored.And new solution should guarantee that supervisory system moves safely and steadly.

Summary of the invention

The object of the present invention is to provide a kind of communication plan of new cluster network system.

Another object of the present invention is to provide a kind of system of selection of computers group monitoring middle-agent node.

A further object of the present invention is to provide a kind of method of replacing automatically when middle-agent's node fails.

Further purpose of the present invention is to provide a kind of monitor network that can safe and stable operation.

The present invention is a kind of method that group monitoring network middle-agent node is selected that solves, this method comprises: all node computers of a monitored group of planes are divided into several groups, node acquisition module of operation on each node computer, be responsible for collection to the node data, a proxy module all in service on each node machine, make middle-agent's module can run on two states, when system start-up, node is acted on behalf of in the centre carry out initial setting up, in system's operational process, if middle-agent's module lost efficacy, replace dynamically.

Description of drawings

Fig. 1 represents the structure of existing monitor network.

Fig. 2 represents the communication structure according to hierarchical monitoring network of the present invention.

Fig. 3 represents according to the communication structure after the hierarchical monitoring network introducing middle-agent node of the present invention system of selection.

NP layoutprocedure when Fig. 4 represents according to system start-up of the present invention.

Fig. 5 represents according to the NP replacement process in the system of the present invention operational process.

Embodiment

In order to realize purpose of the present invention, can adopt following method.As shown in Figure 2, on monitoring host computer, moving basic service module (BSP) 11, all node computers are divided into several groups 12, in each group, all moving node acquisition module (NA) 13 on each node machine, also moving node proxy module (NP) 14 on the node computer and have in every group.NA module and np module all are software or the programs that runs on the node computer operating system.

Wherein, BSP is responsible for sending data acquisition command when needs are understood the Network of Workstation running status, waits for and receive the data of being returned by node computer then, and it is gathered and analyzing and processing; NP is responsible for after the acquisition of receiving from BSP, sends acquisition the NA module of all node computers in this group to, waits for and receive the data that the NA module is returned then, and it is gathered the unified BSP that sends in back; The then responsible running state data of periodically gathering the place node computer of NA, and after receiving acquisition, return up-to-date image data once immediately.

In the primary information gatherer process, BSP sends to all NP to acquisition by the udp broadcast mode, after NP receives acquisition, by the udp broadcast mode order is sent to all NA in the group of place again.Run on the running state data that NA periodically gathers the place node computer on each node computer, during acquisition that the NP in receiving the place group on certain node computer sends, just give this NP, by NP the uniform data of collecting is passed to the BSP that moves on the monitoring host computer again data transfer.The BSP that moves on the monitoring host computer receives the running state data of all node computers that each NP transmits, and gathers and analyzes, and realizes the monitoring to a whole group of planes.

Adopt this hierarchical policy, the node proxy module plays key effect in whole monitor network, if certain node proxy module because the accidental cause cisco unity malfunction, monitoring host computer just can not in time obtain the running state data of all nodes of respective sets.

As seen, need further solve two problems: the first, to acting on behalf of the selection of the node node of proxy module (operation node), the second, break down and can not continue to exercise agent functionality if act on behalf of node itself, select the new node of acting on behalf of.

The invention reside in and allow middle-agent's module NP can run on two states: enabled state (NP ^Enable) and illegal state (NP ^Disable).

As shown in Figure 3, on monitoring host computer, moving basic service module (BSP) 11, all node computers are divided into m group 12, in each group n node computer arranged, in each group, all moving node acquisition module (NA) 13 simultaneously on each node computer and node proxy module NP (comprises the NP that runs on enabled state ^Enable21 and the NP of illegal state ^Disable22), still, in each group, have only the NP that moves on the node computer to be in enabled state, i.e. NP ^Enable

In the primary information gatherer process, BSP sends to all NP to acquisition by the udp broadcast mode ^Enable, NP ^EnabieAfter receiving acquisition, by the udp broadcast mode order is sent to all NA in the group of place again.Run on the running state data that NA periodically gathers the place node computer on each node computer, as the NP that receives the place group ^EnableDuring the acquisition sent, just give this NP with data transfer ^Enable, again by NP ^EnableThe uniform data of collecting is passed to the BSP that moves on the monitoring host computer.The BSP that moves on the monitoring host computer receives each NP ^EnableThe running state data of all node computers that transmit gathers and analyzes, and realizes the monitoring to a whole group of planes.

According to above explanation, we as can be seen, the NP that only is in enabled state (is NP ^Enable) just really exercise the function of middle-agent's node, be responsible for transferring command and data between BSP and NP.If NP ^EnableFortuitous event appears in the place node computer, causes this NP ^EnableCan't operate as normal (we are referred to as NP and lost efficacy), supervisory system just can't be to this NP ^EnableThe node computer of place group is monitored.

The present invention is conceived under the different situations conversion of two kinds of running statuses of NP be realized the automatic selection and the replacement of middle-agent's node.

The automatic selection of middle-agent's node needs comprehensive two kinds of situations, and a kind of is the NP of supervisory system when starting ^EnableSelect, another kind is NP in the supervisory control system running process ^EnableReplacement.Describe in detail NP below in conjunction with accompanying drawing ^EnableThe method of selecting and replacing:

One, the NP during system start-up ^EnableSelect.

As shown in Figure 4, when supervisory system starts, all move two modules of NA, NP on each node computer.All np modules are in init state, send heartbeat message to BSP.BSP writes down the NP of first heartbeat in every group of node, and its NP as this group node ^Enable, then, send NP configuration order 31 with broadcast mode, notify NP all in this group.It is enabled state NP that selecteed NP changes its state ^Enable, and sending NP configuration response 32 to BSP, it is illegal state that other NP changes its state.NP ^EnableFurther send NP configuration announcement, inform NP with broadcast mode all NA in this group ^EnableThe position at place.

Two, the NP in the supervisory control system running process ^EnableReplace.

In the supervisory control system running process, if NP ^EnabieThe place node computer breaks down, and may cause NP ^EnableCan not finish set function.Therefore, require system can in time detect the situation that NP lost efficacy, and select new NP ^Enabte

The NP that is in enabled state (is NP ^Enable) can be ceaselessly send heartbeat message to BSP, and the NP that is in illegal state (is NP ^Disable) do not send heartbeat message to BSP.Like this, BSP can be at any time with each group node in NP ^EnableKeep in touch.In case the NP in certain group node ^EnableLost efficacy, BSP will learn this situation rapidly, and carries out NP by following process and select.

As shown in Figure 5, BSP will organize at certain and select new NP in the node ^Enable, at first all np modules send NP select command 35 in this group.Each np module (no matter whether it is in enabled state) all sends NP to BSP and selects response 36, and BSP writes down first NP that sends response, its NP as this group node ^Enable, send NP configuration order 31 with broadcast mode then, notify all NP in this group.Selected as NP ^EnableNp module to change its state be enabled state, and send NP configuration response 32 to BSP, other np module changes its state to be illegal state (if being in enabled state originally) or to keep its illegal state.Next, new NP ^EnableFurther send NP configuration announcement 33, inform NP with broadcast mode all NA modules in this group ^EnableThe position at place.

We are not difficult to find out, the method according to this invention, and supervisory system can realize the monitoring to large-scale Network of Workstation.And, when supervisory system starts, can carry out the selection of middle-agent's node automatically, and in the supervisory control system running process, when the centre acts on behalf of that node breaks down and can not finish set function the time, can select middle-agent's node of making new advances too, thereby guarantee the stable operation of supervisory system.

Obviously, about the inner structure of various programs, those skilled in the art are easily according to the present invention to its programming, just repeat no more here.

Those skilled in the art can carry out various changes and modification to computer group method for communicating of the present invention and system and not break away from the spirit and scope of the present invention.If of the present invention like this these are revised and modification belongs within claim of the present invention and the equivalent technologies scope thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1, the method for a kind of cluster network communication middle-agent node selection, this method comprises the steps:

Node acquisition module of operation is responsible for the collection to the node data on each node computer;

A proxy module all in service on each node computer;

Make middle-agent's module be in enabled state or illegal state, the middle-agent's module place node that is in enabled state is current middle-agent's node, is responsible for transferring command and data between monitoring host computer and each node computer;

When system start-up, node is acted on behalf of in the centre carry out initial setting up, when initial setting up, monitoring host computer sends to each middle-agent's module order is set, each middle-agent's module is returned response is set, and middle-agent's module each the node acquisition module in this group that is set to enabled state sends announcement is set;

In system's operational process, if middle-agent's module lost efficacy, carry out dynamic replacement, when dynamic replacement, monitoring host computer sends select command to each middle-agent's module, and each middle-agent's module is returned and selected to respond, and monitoring host computer sends to each middle-agent's module order is set, each middle-agent's module is returned response is set, and middle-agent's module each the node acquisition module in this group that is set to enabled state sends announcement is set.

2, the method for middle-agent's node selection as claimed in claim 1, further comprise step: the node machine that a described group of planes is divided is some groups, sets up described middle-agent's module in each group.