CN103259832A

CN103259832A - Cluster resource control method for achieving dynamic load balance, fault diagnosis and failover

Info

Publication number: CN103259832A
Application number: CN2012105661675A
Authority: CN
Inventors: 于海斌; 史海波; 潘福成; 胡国良; 里鹏; 段彬
Original assignee: Shenyang Institute of Automation of CAS
Current assignee: Shenyang Institute of Automation of CAS
Priority date: 2012-12-24
Filing date: 2012-12-24
Publication date: 2013-08-21

Abstract

The invention provides a cluster resource control method for achieving dynamic load balance, fault diagnosis and failover. The cluster resource control method comprises the following processes: a, a client side interacts with a cluster system; b, nodes are registered and cancelled on a cluster resource coordinator, and the cluster resource coordinator obtains loading and state information of the nodes; c, other nodes choose a new clutch resource coordinator to establish a new cluster system after the prior cluster resource coordinator breaks down. According to the cluster resource control method, a software load balance mode is applied, nodes are distributed and regulated dynamically and failure nodes are processed timely according to the resource condition of the cluster nodes, the whole throughput rate, efficiency and robustness of a server are improved effectively, node number of the cluster server is adjusted dynamically as needed, the utilization rate of the server is improved, and the cost of companies in the process of information improvement is reduced.

Description

Realize the cluster resource control method of balancing dynamic load, failure diagnosis and transfer

Technical field

The present invention relates to computer distribution type software systems field, particularly a kind of cluster resource control method that realizes balancing dynamic load, failure diagnosis and transfer.

Background technology

In the computer distribution type software systems, the load balance of server is backstage framework question of common concern, how enterprise-level software is carried out load balance, failed services is effectively handled is had very important meaning.Hardware load balancer commonly used at present has F5BIG-IP, Citrix NetScaler, Radware, products such as Cisco CSS, Foundry, but these valuable product, up to hundreds of thousands even RMB up to a million, too high to the cost requirement of medium-sized and small enterprises information reform.

When the heavier task of some loads (as: need carry out inquiry, database access, the long response traffic of computation-intensive) and the low weight simultaneous situation of task of load, should avoid the node that has overload operation have very long request queue also constantly to receive new request, and the node that has is the generation of idle this phenomenon substantially.Be necessary to adopt certain mechanism, make the cluster resource telegon can understand the load condition of each node in real time, and adjust rapidly according to load state.

Summary of the invention

Recover problem at processing node laod unbalance and malfunctioning node in the computer distribution type software systems, the present invention proposes a kind of realization balancing dynamic load, the cluster resource control method of failure diagnosis and transfer, method is considered real-time load and the responding ability of each node, COS according to task requests, active users, the situation of current network bandwidth, and current server resource (cpu busy percentage, residue physics memory size, database connection pool can be used linking number) situation about utilizing etc. constantly adjusts the ratio of task distribution, still receive a large amount of requests when avoiding some node overload, thereby improve overall throughput and the efficient of trooping.

The technical scheme that the present invention adopts for achieving the above object is: a kind of cluster resource control method that realizes balancing dynamic load, failure diagnosis and transfer may further comprise the steps:

In cluster system, start the cluster resource telegon, and new node log-on message table more, the cluster resource telegon obtains node load and state information in real time by the heartbeat process;

Client is sent task requests to the cluster resource telegon, carries out the system interaction of node load balance with cluster system.

Described more new node log-on message table may further comprise the steps:

Cluster resource telegon circulation detecting, if there is new node to add, then this node is registered to the cluster resource telegon, and this nodal information is added ingress log-on message table; If there is node to close, then this node is nullified to the cluster resource telegon, and this nodal information is deleted from node log-on message table.

Described system interaction may further comprise the steps:

Client sends to cluster system cluster resource telegon with task requests, and the cluster resource telegon sends to the node with Processing tasks request ability after receiving task requests, otherwise the cluster resource telegon sends refusal information to client; Whether the decision node load surpasses threshold value, if node load is all above threshold value, then task requests is added waiting list, otherwise distribute the minimum node of load that task requests is handled, for task requests overtime in the waiting list, the cluster resource telegon sends refusal information to client.

If there is node to break down, the cluster resource telegon is restarted and the fault transfer malfunctioning node, may further comprise the steps:

If node breaks down, the cluster resource telegon sends the heartbeat message failure to node, then carries out remote process by WMI and restarts, and realizes the recovery to malfunctioning node; For the node that recovers failure, the cluster resource telegon is assigned to other nodes with task.

If the cluster resource telegon breaks down, elect new node as the cluster resource telegon, start the service state monitoring process, set up new cluster system.

The step that judgement cluster resource telegon or node break down is as follows:

Node is communicated by letter with the cluster resource telegon by the heartbeat process, if fault is then judged this node and other node communication states, if communication is normal, then the cluster resource telegon breaks down, otherwise this node breaks down.

Described cluster resource telegon is operation service condition monitoring process and heartbeat process, and is responsible for monitoring and collecting the troop load information of interior each node and the central processing node of state information.

The present invention has the following advantages:

1. can carry out dynamic assignment and adjustment, timely handling failure node according to the resource situation of cluster node, effectively improve overall throughput, efficient and the robustness of service;

2. can dynamically adjust the node number of Cluster Server according to demand, improve the utilance of server;

3. reduce the cost of enterprise in the information reform process.

Description of drawings

Fig. 1 is cluster resource coordinated control system frame diagram;

Fig. 2 is client and cluster system interaction diagrams;

Fig. 3 is node registration, logout flow path figure;

Fig. 4 is for malfunctioning node is restarted, transfer flow figure;

Fig. 5 is that new cluster system is set up flow chart.

Wherein, 1 is heartbeat, and 2 is cluster system, and 3 are the task processing, and 4 is the cluster resource telegon, and 5 is node, and 6 is task requests, and 7 is client, and 8 is application end, and 9 is the user.

Embodiment

The present invention is described in further detail below in conjunction with drawings and Examples.

Fig. 1 shows cluster resource coordinated control system frame diagram, and software is made up of server zone collecting system and client, and client is communicated by letter with cluster system by network.

Cluster system is transparent to client, and client sends to the cluster resource telegon with task requests, and the cluster resource telegon assigns the task to optimal node processing, and processed the finishing of task returns to client with response message afterwards.System adopts Enterprise SOA (SOA-service oriented architecture), service end is the server zone collecting system that multinode constitutes, in trooping, operation service condition monitoring process on the cluster resource telegon, operation heartbeat process on other nodes; The cluster resource telegon is responsible for monitoring and collecting load information and the state information of interior each node of trooping, each node is responsible for regularly reporting self load and state information to the cluster resource telegon, obtains the heat copy of whole cluster information simultaneously from the cluster resource telegon by the heartbeat agreement; When the cluster resource telegon detects certain node and breaks down, at first malfunctioning node is restarted, for the malfunctioning node that can not restart, the task transfers that the cluster resource telegon is born malfunctioning node is to other nodes; When the cluster resource telegon breaks down, each node will elect the minimum node of load as new cluster resource telegon by the heartbeat process communication, the cluster resource telegon that is elected will start the service state monitoring process, set up into new cluster system; Client sends to the cluster resource telegon with task requests, and the cluster resource telegon distributes the minimum node processing task requests of load for it, and the processing of task requests is transparent to client.The present invention can effectively apply in the SOA framework software systems of multiserver node.

Be illustrated in figure 2 as client and cluster system interaction diagrams.

The reciprocal process of client and cluster system is as follows:

Client sends to cluster system cluster resource telegon with task requests, the cluster resource telegon receives the node that judges whether to have Processing tasks request ability according to the node log-on message after the task requests, for can not handling of task, the cluster resource telegon will return refusal information; Otherwise whether the decision node load surpasses threshold values, if node load then adds waiting list with task requests all above threshold value, otherwise distribute the minimum node of load that task requests is handled, for task overtime in the formation, will return refusal information to client.

The cluster resource telegon is the central processing node of cluster system, the main program of operation cluster resource control method for coordinating, comprise the service state monitoring process and with the heartbeat process of node communication.The heartbeat process also is the heartbeat agreement, is the process of carrying out communication according to the agreement that defines between cluster resource telegon and each node.The service state monitoring process be according to the cycle of setting regularly to each node service transmission and reception information, monitor the process of each node service state in real time.

Be illustrated in figure 3 as node registration, logout flow path figure.

Node is as follows to the process that the registration of cluster resource telegon, cancellation and cluster resource telegon obtain node load and state information in real time:

Begin the circulation detecting after the cluster resource telegon starts and whether node is arranged to its registration or cancellation, if node registration is arranged then nodal information is added in the ingress log-on message table, if endpoint unregistration is arranged then this nodal information is deleted from node log-on message table, the cluster resource telegon obtains load and the state information of all line nodes in real time by the heartbeat agreement, and new node log-on message table more.

Node log-on message table is the tabulation of the mission bit stream born of memory node load, state information and node.Node log-on message table in memory cache and node load, the state information of preserving in the persistent storage medium such as database, the mission bit stream that node is born; Buffer memory is in order to improve processing speed in internal memory, and preserving in the persistent storage medium is in order to guarantee the recovery to malfunctioning node----because when node breaks down, and the data in the internal memory may be lost.

Be illustrated in figure 4 as that malfunctioning node is restarted, transfer flow figure.

The cluster resource telegon is restarted with the process that shifts fault as follows to malfunctioning node:

After starting, the cluster resource telegon begins to send heartbeat message according to node log-on message table to all line nodes, the state of detecting node, when node breaks down, be that the cluster resource telegon sends the heartbeat message failure to node, then by WMI(Windows Management Instrumentation, the Windows management regulation) carries out remote process and restart realization to the recovery of malfunctioning node, for the node of restarting failure, the task that the cluster resource telegon is born malfunctioning node is assigned to other nodes, and notifies other node updates tasks.

Be illustrated in figure 5 as new cluster system and set up flow chart.

The cluster resource telegon breaks down, and the process that the new cluster resource telegon of other node elections is set up new cluster system is as follows:

When the cluster resource telegon breaks down, all line nodes can not communicate with, to carry out heartbeat communication (according to the node log-on message heat copy that obtains from the cluster resource telegon before) between the node, to determine whether that the cluster resource telegon breaks down, if certain node can not carry out heartbeat communication with other nodes, then may be that present node breaks down, present node should withdraw from, and waits for that the cluster resource telegon carries out reboot process to it; If heartbeat communication is normal between the node, can determine that then the cluster resource telegon breaks down, the cluster resource telegon that the election of will communicating by letter again between the node makes new advances, selected node will start the service state monitoring process and set up into new cluster system.

The heartbeat process is the process that cluster resource telegon and each node carry out communication according to the agreement of setting.

The service state monitoring process be according to the cycle of setting regularly to each node service transmission and reception information, monitor the process of each node service state in real time.

Claims

1. a cluster resource control method that realizes balancing dynamic load, failure diagnosis and transfer is characterized in that, may further comprise the steps:

2. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer is characterized in that, described more new node log-on message table may further comprise the steps:

3. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer is characterized in that, described system interaction may further comprise the steps:

4. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer is characterized in that, if there is node to break down, the cluster resource telegon is restarted and the fault transfer malfunctioning node, may further comprise the steps:

5. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer, it is characterized in that, if the cluster resource telegon breaks down, elect new node as the cluster resource telegon, start the service state monitoring process, set up new cluster system.

6. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer is characterized in that, the step that judgement cluster resource telegon or node break down is as follows:

7. according to the cluster resource control method of the arbitrary described realization balancing dynamic load of claim 1 ~ 6, failure diagnosis and transfer, it is characterized in that, described cluster resource telegon is operation service condition monitoring process and heartbeat process, and is responsible for monitoring and collecting the troop load information of interior each node and the central processing node of state information.