CN103259832A - Cluster resource control method for achieving dynamic load balance, fault diagnosis and failover - Google Patents

Cluster resource control method for achieving dynamic load balance, fault diagnosis and failover Download PDF

Info

Publication number
CN103259832A
CN103259832A CN2012105661675A CN201210566167A CN103259832A CN 103259832 A CN103259832 A CN 103259832A CN 2012105661675 A CN2012105661675 A CN 2012105661675A CN 201210566167 A CN201210566167 A CN 201210566167A CN 103259832 A CN103259832 A CN 103259832A
Authority
CN
China
Prior art keywords
node
cluster resource
telegon
cluster
control method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012105661675A
Other languages
Chinese (zh)
Inventor
于海斌
史海波
潘福成
胡国良
里鹏
段彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Institute of Automation of CAS
Original Assignee
Shenyang Institute of Automation of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Institute of Automation of CAS filed Critical Shenyang Institute of Automation of CAS
Priority to CN2012105661675A priority Critical patent/CN103259832A/en
Publication of CN103259832A publication Critical patent/CN103259832A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a cluster resource control method for achieving dynamic load balance, fault diagnosis and failover. The cluster resource control method comprises the following processes: a, a client side interacts with a cluster system; b, nodes are registered and cancelled on a cluster resource coordinator, and the cluster resource coordinator obtains loading and state information of the nodes; c, other nodes choose a new clutch resource coordinator to establish a new cluster system after the prior cluster resource coordinator breaks down. According to the cluster resource control method, a software load balance mode is applied, nodes are distributed and regulated dynamically and failure nodes are processed timely according to the resource condition of the cluster nodes, the whole throughput rate, efficiency and robustness of a server are improved effectively, node number of the cluster server is adjusted dynamically as needed, the utilization rate of the server is improved, and the cost of companies in the process of information improvement is reduced.

Description

Realize the cluster resource control method of balancing dynamic load, failure diagnosis and transfer
Technical field
The present invention relates to computer distribution type software systems field, particularly a kind of cluster resource control method that realizes balancing dynamic load, failure diagnosis and transfer.
Background technology
In the computer distribution type software systems, the load balance of server is backstage framework question of common concern, how enterprise-level software is carried out load balance, failed services is effectively handled is had very important meaning.Hardware load balancer commonly used at present has F5BIG-IP, Citrix NetScaler, Radware, products such as Cisco CSS, Foundry, but these valuable product, up to hundreds of thousands even RMB up to a million, too high to the cost requirement of medium-sized and small enterprises information reform.
When the heavier task of some loads (as: need carry out inquiry, database access, the long response traffic of computation-intensive) and the low weight simultaneous situation of task of load, should avoid the node that has overload operation have very long request queue also constantly to receive new request, and the node that has is the generation of idle this phenomenon substantially.Be necessary to adopt certain mechanism, make the cluster resource telegon can understand the load condition of each node in real time, and adjust rapidly according to load state.
Summary of the invention
Recover problem at processing node laod unbalance and malfunctioning node in the computer distribution type software systems, the present invention proposes a kind of realization balancing dynamic load, the cluster resource control method of failure diagnosis and transfer, method is considered real-time load and the responding ability of each node, COS according to task requests, active users, the situation of current network bandwidth, and current server resource (cpu busy percentage, residue physics memory size, database connection pool can be used linking number) situation about utilizing etc. constantly adjusts the ratio of task distribution, still receive a large amount of requests when avoiding some node overload, thereby improve overall throughput and the efficient of trooping.
The technical scheme that the present invention adopts for achieving the above object is: a kind of cluster resource control method that realizes balancing dynamic load, failure diagnosis and transfer may further comprise the steps:
In cluster system, start the cluster resource telegon, and new node log-on message table more, the cluster resource telegon obtains node load and state information in real time by the heartbeat process;
Client is sent task requests to the cluster resource telegon, carries out the system interaction of node load balance with cluster system.
Described more new node log-on message table may further comprise the steps:
Cluster resource telegon circulation detecting, if there is new node to add, then this node is registered to the cluster resource telegon, and this nodal information is added ingress log-on message table; If there is node to close, then this node is nullified to the cluster resource telegon, and this nodal information is deleted from node log-on message table.
Described system interaction may further comprise the steps:
Client sends to cluster system cluster resource telegon with task requests, and the cluster resource telegon sends to the node with Processing tasks request ability after receiving task requests, otherwise the cluster resource telegon sends refusal information to client; Whether the decision node load surpasses threshold value, if node load is all above threshold value, then task requests is added waiting list, otherwise distribute the minimum node of load that task requests is handled, for task requests overtime in the waiting list, the cluster resource telegon sends refusal information to client.
If there is node to break down, the cluster resource telegon is restarted and the fault transfer malfunctioning node, may further comprise the steps:
If node breaks down, the cluster resource telegon sends the heartbeat message failure to node, then carries out remote process by WMI and restarts, and realizes the recovery to malfunctioning node; For the node that recovers failure, the cluster resource telegon is assigned to other nodes with task.
If the cluster resource telegon breaks down, elect new node as the cluster resource telegon, start the service state monitoring process, set up new cluster system.
The step that judgement cluster resource telegon or node break down is as follows:
Node is communicated by letter with the cluster resource telegon by the heartbeat process, if fault is then judged this node and other node communication states, if communication is normal, then the cluster resource telegon breaks down, otherwise this node breaks down.
Described cluster resource telegon is operation service condition monitoring process and heartbeat process, and is responsible for monitoring and collecting the troop load information of interior each node and the central processing node of state information.
The present invention has the following advantages:
1. can carry out dynamic assignment and adjustment, timely handling failure node according to the resource situation of cluster node, effectively improve overall throughput, efficient and the robustness of service;
2. can dynamically adjust the node number of Cluster Server according to demand, improve the utilance of server;
3. reduce the cost of enterprise in the information reform process.
Description of drawings
Fig. 1 is cluster resource coordinated control system frame diagram;
Fig. 2 is client and cluster system interaction diagrams;
Fig. 3 is node registration, logout flow path figure;
Fig. 4 is for malfunctioning node is restarted, transfer flow figure;
Fig. 5 is that new cluster system is set up flow chart.
Wherein, 1 is heartbeat, and 2 is cluster system, and 3 are the task processing, and 4 is the cluster resource telegon, and 5 is node, and 6 is task requests, and 7 is client, and 8 is application end, and 9 is the user.
Embodiment
The present invention is described in further detail below in conjunction with drawings and Examples.
Fig. 1 shows cluster resource coordinated control system frame diagram, and software is made up of server zone collecting system and client, and client is communicated by letter with cluster system by network.
Cluster system is transparent to client, and client sends to the cluster resource telegon with task requests, and the cluster resource telegon assigns the task to optimal node processing, and processed the finishing of task returns to client with response message afterwards.System adopts Enterprise SOA (SOA-service oriented architecture), service end is the server zone collecting system that multinode constitutes, in trooping, operation service condition monitoring process on the cluster resource telegon, operation heartbeat process on other nodes; The cluster resource telegon is responsible for monitoring and collecting load information and the state information of interior each node of trooping, each node is responsible for regularly reporting self load and state information to the cluster resource telegon, obtains the heat copy of whole cluster information simultaneously from the cluster resource telegon by the heartbeat agreement; When the cluster resource telegon detects certain node and breaks down, at first malfunctioning node is restarted, for the malfunctioning node that can not restart, the task transfers that the cluster resource telegon is born malfunctioning node is to other nodes; When the cluster resource telegon breaks down, each node will elect the minimum node of load as new cluster resource telegon by the heartbeat process communication, the cluster resource telegon that is elected will start the service state monitoring process, set up into new cluster system; Client sends to the cluster resource telegon with task requests, and the cluster resource telegon distributes the minimum node processing task requests of load for it, and the processing of task requests is transparent to client.The present invention can effectively apply in the SOA framework software systems of multiserver node.
Be illustrated in figure 2 as client and cluster system interaction diagrams.
The reciprocal process of client and cluster system is as follows:
Client sends to cluster system cluster resource telegon with task requests, the cluster resource telegon receives the node that judges whether to have Processing tasks request ability according to the node log-on message after the task requests, for can not handling of task, the cluster resource telegon will return refusal information; Otherwise whether the decision node load surpasses threshold values, if node load then adds waiting list with task requests all above threshold value, otherwise distribute the minimum node of load that task requests is handled, for task overtime in the formation, will return refusal information to client.
The cluster resource telegon is the central processing node of cluster system, the main program of operation cluster resource control method for coordinating, comprise the service state monitoring process and with the heartbeat process of node communication.The heartbeat process also is the heartbeat agreement, is the process of carrying out communication according to the agreement that defines between cluster resource telegon and each node.The service state monitoring process be according to the cycle of setting regularly to each node service transmission and reception information, monitor the process of each node service state in real time.
Be illustrated in figure 3 as node registration, logout flow path figure.
Node is as follows to the process that the registration of cluster resource telegon, cancellation and cluster resource telegon obtain node load and state information in real time:
Begin the circulation detecting after the cluster resource telegon starts and whether node is arranged to its registration or cancellation, if node registration is arranged then nodal information is added in the ingress log-on message table, if endpoint unregistration is arranged then this nodal information is deleted from node log-on message table, the cluster resource telegon obtains load and the state information of all line nodes in real time by the heartbeat agreement, and new node log-on message table more.
Node log-on message table is the tabulation of the mission bit stream born of memory node load, state information and node.Node log-on message table in memory cache and node load, the state information of preserving in the persistent storage medium such as database, the mission bit stream that node is born; Buffer memory is in order to improve processing speed in internal memory, and preserving in the persistent storage medium is in order to guarantee the recovery to malfunctioning node----because when node breaks down, and the data in the internal memory may be lost.
Be illustrated in figure 4 as that malfunctioning node is restarted, transfer flow figure.
The cluster resource telegon is restarted with the process that shifts fault as follows to malfunctioning node:
After starting, the cluster resource telegon begins to send heartbeat message according to node log-on message table to all line nodes, the state of detecting node, when node breaks down, be that the cluster resource telegon sends the heartbeat message failure to node, then by WMI(Windows Management Instrumentation, the Windows management regulation) carries out remote process and restart realization to the recovery of malfunctioning node, for the node of restarting failure, the task that the cluster resource telegon is born malfunctioning node is assigned to other nodes, and notifies other node updates tasks.
Be illustrated in figure 5 as new cluster system and set up flow chart.
The cluster resource telegon breaks down, and the process that the new cluster resource telegon of other node elections is set up new cluster system is as follows:
When the cluster resource telegon breaks down, all line nodes can not communicate with, to carry out heartbeat communication (according to the node log-on message heat copy that obtains from the cluster resource telegon before) between the node, to determine whether that the cluster resource telegon breaks down, if certain node can not carry out heartbeat communication with other nodes, then may be that present node breaks down, present node should withdraw from, and waits for that the cluster resource telegon carries out reboot process to it; If heartbeat communication is normal between the node, can determine that then the cluster resource telegon breaks down, the cluster resource telegon that the election of will communicating by letter again between the node makes new advances, selected node will start the service state monitoring process and set up into new cluster system.
The heartbeat process is the process that cluster resource telegon and each node carry out communication according to the agreement of setting.
The service state monitoring process be according to the cycle of setting regularly to each node service transmission and reception information, monitor the process of each node service state in real time.

Claims (7)

1. a cluster resource control method that realizes balancing dynamic load, failure diagnosis and transfer is characterized in that, may further comprise the steps:
In cluster system, start the cluster resource telegon, and new node log-on message table more, the cluster resource telegon obtains node load and state information in real time by the heartbeat process;
Client is sent task requests to the cluster resource telegon, carries out the system interaction of node load balance with cluster system.
2. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer is characterized in that, described more new node log-on message table may further comprise the steps:
Cluster resource telegon circulation detecting, if there is new node to add, then this node is registered to the cluster resource telegon, and this nodal information is added ingress log-on message table; If there is node to close, then this node is nullified to the cluster resource telegon, and this nodal information is deleted from node log-on message table.
3. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer is characterized in that, described system interaction may further comprise the steps:
Client sends to cluster system cluster resource telegon with task requests, and the cluster resource telegon sends to the node with Processing tasks request ability after receiving task requests, otherwise the cluster resource telegon sends refusal information to client; Whether the decision node load surpasses threshold value, if node load is all above threshold value, then task requests is added waiting list, otherwise distribute the minimum node of load that task requests is handled, for task requests overtime in the waiting list, the cluster resource telegon sends refusal information to client.
4. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer is characterized in that, if there is node to break down, the cluster resource telegon is restarted and the fault transfer malfunctioning node, may further comprise the steps:
If node breaks down, the cluster resource telegon sends the heartbeat message failure to node, then carries out remote process by WMI and restarts, and realizes the recovery to malfunctioning node; For the node that recovers failure, the cluster resource telegon is assigned to other nodes with task.
5. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer, it is characterized in that, if the cluster resource telegon breaks down, elect new node as the cluster resource telegon, start the service state monitoring process, set up new cluster system.
6. the cluster resource control method of realization balancing dynamic load according to claim 1, failure diagnosis and transfer is characterized in that, the step that judgement cluster resource telegon or node break down is as follows:
Node is communicated by letter with the cluster resource telegon by the heartbeat process, if fault is then judged this node and other node communication states, if communication is normal, then the cluster resource telegon breaks down, otherwise this node breaks down.
7. according to the cluster resource control method of the arbitrary described realization balancing dynamic load of claim 1 ~ 6, failure diagnosis and transfer, it is characterized in that, described cluster resource telegon is operation service condition monitoring process and heartbeat process, and is responsible for monitoring and collecting the troop load information of interior each node and the central processing node of state information.
CN2012105661675A 2012-12-24 2012-12-24 Cluster resource control method for achieving dynamic load balance, fault diagnosis and failover Pending CN103259832A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012105661675A CN103259832A (en) 2012-12-24 2012-12-24 Cluster resource control method for achieving dynamic load balance, fault diagnosis and failover

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012105661675A CN103259832A (en) 2012-12-24 2012-12-24 Cluster resource control method for achieving dynamic load balance, fault diagnosis and failover

Publications (1)

Publication Number Publication Date
CN103259832A true CN103259832A (en) 2013-08-21

Family

ID=48963528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012105661675A Pending CN103259832A (en) 2012-12-24 2012-12-24 Cluster resource control method for achieving dynamic load balance, fault diagnosis and failover

Country Status (1)

Country Link
CN (1) CN103259832A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103595572A (en) * 2013-11-27 2014-02-19 牛永伟 Selfreparing method of nodes in cloud computing cluster
CN104301241A (en) * 2014-06-05 2015-01-21 中国人民解放军信息工程大学 SOA dynamic load distribution method and system
CN104837165A (en) * 2014-02-11 2015-08-12 现代自动车株式会社 Method and system for initializing radio network for radio modules in vehicle
CN105357042A (en) * 2015-10-30 2016-02-24 浪潮(北京)电子信息产业有限公司 High-availability cluster system, master node and slave node
CN105373431A (en) * 2015-10-29 2016-03-02 武汉联影医疗科技有限公司 Computer system resource management method and computer resource management system
CN105591790A (en) * 2014-12-30 2016-05-18 中国银联股份有限公司 Data communication connection pool management device
CN105808343A (en) * 2014-12-31 2016-07-27 中国科学院沈阳自动化研究所 Cluster resource control method used for complicated production management system
CN105939389A (en) * 2016-06-29 2016-09-14 乐视控股(北京)有限公司 Load balancing method and device
CN106155770A (en) * 2015-03-30 2016-11-23 联想(北京)有限公司 Method for scheduling task and electronic equipment
CN106547609A (en) * 2015-09-18 2017-03-29 阿里巴巴集团控股有限公司 A kind of event-handling method and equipment
CN106850240A (en) * 2015-12-03 2017-06-13 财团法人车辆研究测试中心 Automobile-used distributed network management system and method
CN107153660A (en) * 2016-03-04 2017-09-12 福建天晴数码有限公司 The fault detect processing method and its system of distributed data base system
CN110798339A (en) * 2019-10-09 2020-02-14 国电南瑞科技股份有限公司 Task disaster tolerance method based on distributed task scheduling framework
CN111124806A (en) * 2019-11-25 2020-05-08 山东鲁能软件技术有限公司 Equipment state real-time monitoring method and system based on distributed scheduling task
CN111427689A (en) * 2020-03-24 2020-07-17 苏州科达科技股份有限公司 Cluster keep-alive method and device and storage medium
CN111694789A (en) * 2020-04-22 2020-09-22 西安电子科技大学 Embedded reconfigurable heterogeneous determination method, system, storage medium and processor
CN112134744A (en) * 2020-10-23 2020-12-25 上海途鸽数据科技有限公司 Management method of nodes in distributed management system
CN112346837A (en) * 2020-10-28 2021-02-09 常州微亿智造科技有限公司 Distributed timer system under industrial Internet of things

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1578320A (en) * 2003-06-30 2005-02-09 微软公司 Network load balancing with main machine status information
CN1659910A (en) * 2002-06-13 2005-08-24 Ut斯达康有限公司 System and method for packet data serving node load balancing and fault tolerance

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1659910A (en) * 2002-06-13 2005-08-24 Ut斯达康有限公司 System and method for packet data serving node load balancing and fault tolerance
CN1578320A (en) * 2003-06-30 2005-02-09 微软公司 Network load balancing with main machine status information

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103595572A (en) * 2013-11-27 2014-02-19 牛永伟 Selfreparing method of nodes in cloud computing cluster
CN104837165B (en) * 2014-02-11 2019-09-13 现代自动车株式会社 The method and system of the intiating radio electric network of radio module in vehicle
CN104837165A (en) * 2014-02-11 2015-08-12 现代自动车株式会社 Method and system for initializing radio network for radio modules in vehicle
CN104301241A (en) * 2014-06-05 2015-01-21 中国人民解放军信息工程大学 SOA dynamic load distribution method and system
CN105591790B (en) * 2014-12-30 2019-05-10 中国银联股份有限公司 Data communication connection pool management device
CN105591790A (en) * 2014-12-30 2016-05-18 中国银联股份有限公司 Data communication connection pool management device
CN105808343A (en) * 2014-12-31 2016-07-27 中国科学院沈阳自动化研究所 Cluster resource control method used for complicated production management system
CN105808343B (en) * 2014-12-31 2019-01-04 中国科学院沈阳自动化研究所 For the cluster resource control method in complicated production management system
CN106155770B (en) * 2015-03-30 2019-11-26 联想(北京)有限公司 Method for scheduling task and electronic equipment
CN106155770A (en) * 2015-03-30 2016-11-23 联想(北京)有限公司 Method for scheduling task and electronic equipment
CN106547609A (en) * 2015-09-18 2017-03-29 阿里巴巴集团控股有限公司 A kind of event-handling method and equipment
CN105373431A (en) * 2015-10-29 2016-03-02 武汉联影医疗科技有限公司 Computer system resource management method and computer resource management system
CN105373431B (en) * 2015-10-29 2022-09-27 武汉联影医疗科技有限公司 Computer system resource management method and computer resource management system
CN105357042B (en) * 2015-10-30 2018-09-07 浪潮(北京)电子信息产业有限公司 A kind of highly available cluster system and its host node and from node
CN105357042A (en) * 2015-10-30 2016-02-24 浪潮(北京)电子信息产业有限公司 High-availability cluster system, master node and slave node
CN106850240A (en) * 2015-12-03 2017-06-13 财团法人车辆研究测试中心 Automobile-used distributed network management system and method
CN107153660B (en) * 2016-03-04 2020-03-17 福建天晴数码有限公司 Fault detection processing method and system for distributed database system
CN107153660A (en) * 2016-03-04 2017-09-12 福建天晴数码有限公司 The fault detect processing method and its system of distributed data base system
CN105939389A (en) * 2016-06-29 2016-09-14 乐视控股(北京)有限公司 Load balancing method and device
CN110798339A (en) * 2019-10-09 2020-02-14 国电南瑞科技股份有限公司 Task disaster tolerance method based on distributed task scheduling framework
CN111124806A (en) * 2019-11-25 2020-05-08 山东鲁能软件技术有限公司 Equipment state real-time monitoring method and system based on distributed scheduling task
CN111124806B (en) * 2019-11-25 2023-09-05 山东鲁软数字科技有限公司 Method and system for monitoring equipment state in real time based on distributed scheduling task
CN111427689A (en) * 2020-03-24 2020-07-17 苏州科达科技股份有限公司 Cluster keep-alive method and device and storage medium
CN111427689B (en) * 2020-03-24 2022-06-28 苏州科达科技股份有限公司 Cluster keep-alive method and device and storage medium
CN111694789A (en) * 2020-04-22 2020-09-22 西安电子科技大学 Embedded reconfigurable heterogeneous determination method, system, storage medium and processor
CN112134744A (en) * 2020-10-23 2020-12-25 上海途鸽数据科技有限公司 Management method of nodes in distributed management system
CN112346837A (en) * 2020-10-28 2021-02-09 常州微亿智造科技有限公司 Distributed timer system under industrial Internet of things

Similar Documents

Publication Publication Date Title
CN103259832A (en) Cluster resource control method for achieving dynamic load balance, fault diagnosis and failover
CN107087019B (en) Task scheduling method and device based on end cloud cooperative computing architecture
US10169709B2 (en) Avoiding incompatibility between data and computing processes to enhance computer performance
CN105808343A (en) Cluster resource control method used for complicated production management system
CN101571813B (en) Master/slave scheduling method in multimachine assembly
AU2011304950B2 (en) Method and system for terminal access and management in cloud computing
US20130028091A1 (en) System for controlling switch devices, and device and method for controlling system configuration
CN102480469B (en) Based on the method for the load dispatch of balancing energy and device in a kind of SIP service cluster
US20020087612A1 (en) System and method for reliability-based load balancing and dispatching using software rejuvenation
CN102622265A (en) Method and system for task distribution
CN105373431A (en) Computer system resource management method and computer resource management system
CN108737566B (en) Distributed real-time message filtering system
CN111147573A (en) Data transmission method and device
WO2021120633A1 (en) Load balancing method and related device
CN108282526B (en) Dynamic allocation method and system for servers between double clusters
CN106452966A (en) Multi-gateway management realization method for OpenStack cloud desktop
US20110246815A1 (en) Recovering from lost resources in a distributed server environment
CN101888379B (en) Multi-proxy server dynamic linking method of network television and network television system
CN114900526B (en) Load balancing method and system, computer storage medium and electronic equipment
CN114338670B (en) Edge cloud platform and network-connected traffic three-level cloud control platform with same
CN114629782A (en) Anti-destruction replacing method among multiple cloud platforms
CN115357395A (en) Fault equipment task transfer method and system, electronic equipment and storage medium
CN112527469B (en) Fault-tolerant combination method of cloud computing server
CN110213364B (en) Express cabinet monitoring method, system, storage medium and equipment
CN110519397B (en) SIP terminal access load balancing system and method based on NGINX

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130821