CN103944784A

CN103944784A - Large-scale-cloud-data-center-oriented server cooperative monitoring method

Info

Publication number: CN103944784A
Application number: CN201410166275.2A
Authority: CN
Inventors: 徐小龙; 杨冠; 章韵; 李嘉豪; 张凯; 李爱群
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Weihai Blue Ocean Bank Co., Ltd
Priority date: 2014-04-23
Filing date: 2014-04-23
Publication date: 2014-07-23
Anticipated expiration: 2034-04-23
Also published as: CN103944784B

Abstract

The invention discloses a large-scale-cloud-data-center-oriented server cooperative monitoring method. The mode that servers sense and monitor each other replaces a monitoring mode of a centralized structure, the self-managing capacity of the servers is improved, the monitoring burden of the monitoring servers is effectively lowered, and performance bottlenecks and singe point failure risks are removed. A cooperative monitoring mechanism and a functional component are provided, and the working steps of the cooperative monitoring mechanism are given when data servers are added into a system and a single data server and a group of data servers lose efficacy. The method is used in a large-scale cloud data center, system response time is obviously shorter than the response time of a centralized monitoring mechanism, load balance is effectively achieved, low updating expenditure is achieved, and under the situations that the data servers lose efficacy in a discrete mode and in a group mode, all the data servers which lose efficacy can be detected effectively.

Description

The collaborative method for supervising of a kind of server towards extensive cloud data center

Technical field

The present invention relates to information technology type systematic management application, relate in particular to a kind of collaborative method for supervising of server towards extensive cloud data center.

Background technology

Cloud computing based on concentrate the cloud data center building for user provides dynamically, calculating high performance-price ratio, elasticity Expansion, storage and various information service, change architectural framework and the operating mode of conventional information technical industry, received at present the very big concern of academia and industrial circle both at home and abroad.Main Countries government and the large-scale cloud of numerous and confused structure data center of the enterprise institution with appreciable impact power; Google, Baidu, IBM, Microsoft, Yahoo, Amazon, VMware, Salesforce, Huawei etc. have all proposed cloud computing solution separately; The network system that Facebook, YouTube, Taobao, ten thousand nets, Sina etc. are subject to extensively welcoming is also all based on cloud computing platform.

Data server in cloud data center is the physical basis of all resources of actual bearer, and the normal operation of server is the prerequisite that cloud computing system is stable, service is provided efficiently.Therefore, server monitoring mechanism is most important for cloud computing system efficiently.The emphasis that current cloud computing monitoring administrating system is paid close attention to is that resources of virtual machine and behavior are monitored, and the monitoring of server itself is simply adopted to centralized architecture and heartbeat or poll pattern.For example, Google cloud computing system adopts the state of being responsible for monitoring each data server in the server cluster of cloud data center by one or several main control server." Lan Yun " cloud computing platform of IBM adopts Tivoli monitoring software to monitor the server of cloud data center and the implementation status of task, also adopts centralized monitoring framework.Nagios, by the surveillance of the main frame of cloud computing system extensive use and network state, still adopts centralized monitoring framework.The advantage of centralized monitoring framework is that controllability is strong, easy to maintenance flexibly, and defect is that system exists performance bottleneck and Single Point of Faliure problem.

In, in small-scale data center, if adopt centralized monitoring framework, data server regularly sends heartbeat message to report work at present state as performer to monitoring server, prevents that the server delay bringing of losing efficacy from being feasible.But in large-scale cloud data center, obviously can not adopt simple heartbeat mechanism, because the huge data server of quantity all sends cycle heartbeat message to monitoring server and will bring a large amount of extra network service burdens, and the System and Network resources that easily consume in a large number monitoring servers, cause systematic function bottleneck and monitoring server Problem of Failure, to causing the effect that is similar to distributed denial of service attack.

In order to address the above problem, the mode adopting is at present that configuration possesses the monitoring server of high-performance and high availability, and is aided with the functional module such as journal recovery or dual-host backup, also brings thus system cost is raise, not from dealing with problems in essence.

The present invention is directed to the problem that current cloud data center supervisory control system exists, provide the collaborative method for supervising of a kind of server towards extensive cloud data center, the mode monitoring with data server mutual perception, each other substitutes the monitoring mode of centralized architecture, promote the ability of self-management of server, effectively alleviate the monitoring burden of monitoring server, eliminate performance bottleneck and monitoring server failure risk.

Summary of the invention

For solving the problems of the technologies described above, the invention provides the collaborative method for supervising of a kind of server towards extensive cloud data center, its technical scheme adopting is as follows:

Towards the collaborative method for supervising of server of extensive cloud data center, it is realized based on the collaborative monitoring model of server, and the critical piece of the collaborative monitoring model of server comprises monitoring server, message router, data server, message queue, monitoring routing table, finger daemon; The method of its collaborative monitoring comprises the steps:

Step 1: all data servers are connected successively and form unidirectional loop topological structure, server and follow-up server and be subject to follow-up server monitoring continue before each data server has, data server breaks down and while losing efficacy, is responsible for the failure conditions of data server to report monitoring server by the server that continues thereafter;

Step 2: the collaborative method for supervising of its server is in the time that data server adds system: re-establish the unidirectional loop topological structure that comprises this new data service device, monitoring server adds the situation of system to notice new data service device to task dispatcher;

When individual data server break down and while losing efficacy the collaborative method for supervising of its server be: the follow-up server of this data server be responsible for finding and by this situation report to monitoring server, re-establish the unidirectional loop topological structure of getting rid of this fault data server, monitoring server notices the situation of this data server fault to task dispatcher, and proceeds monitoring;

In the time that data server in blocks loses efficacy, the collaborative method for supervising of its server was: be responsible for finding successively that by follow-up first normal data server in these fail data servers in blocks the situation report also successively data server being lost efficacy is to monitoring server, re-establish the unidirectional loop topological structure of getting rid of this fault data server, monitoring server is noticed the situation of all data server faults to task dispatcher successively, and proceeds monitoring.

In step 2, in the time that data server adds system, the collaborative method for supervising of its server is as follows:

Step 1: first system judges that the current data server that adds system is to add first system or rejoin system; If data server is while adding system first, will be connected with message router by finger daemon, request message router is an independently heartbeat queue of this server establishment, if once there be N data server to add system in system, no matter whether these servers are online at present, on message router, all will have N heartbeat queue, in follow-up phase, server will periodically give out information to one's own heartbeat queue; If data server is to rejoin system, when message router has had this server to add system first, be the heartbeat queue of its establishment, needn't re-create;

Step 2: data server is initiatively reported to monitoring server, concrete way is to issue a message bag that themes as " login " to message router, in the coated global monitoring queue of inserting on message router of this message;

Step 3: monitoring server has been subscribed to this global monitoring queue in the time of initialization, in the time that monitoring server obtains " login " message bag, from this message bag, extract at once the server info (NID of required monitoring, IP, QID), this information is inserted in the monitoring routing table of monitoring server local maintenance, the relative recording of the follow-up server to this server in monitoring routing table is modified, and relevant information is sent to this follow-up server, to rebuild unidirectional loop network topology.

Amending method in step 3 is specially:

Step (1): establish data server DN _m+1add system, monitoring server is DN according to the information in its " login " message bag at monitoring routing table end _m+1a newly-increased record, the simultaneously DN in modification list ₁record corresponding information, as shown in table 2.Add data server DN _m+1information mean DN _m+1be inserted into DN ₁and DN _mbetween, therefore by original DN ₁before pointer (PreNode, the PreQID) information (DN that continues _m, Q _m) insert DN _m+1before record, continue in pointer respective items, then revise DN ₁before the pointer that continues be (DN _m+1, Q _m+1).

Step (2): monitoring server will show in server DN _m+1pointer (DN continues before in corresponding record _m, Q _m) according to DN _m+1iP address send to again server DN _m+1; Also by DN ₁pointer information (DN continues before amended in corresponding record _m+1, Q _m+1) according to DN ₁iP address send to again server DN ₁;

Step (3): server DN _m+1subscribe to and be designated Q to message router application _mheartbeat queue, server DN ₁subscribe to and be designated Q to message router application _m+1heartbeat queue; Network topology has been rebuild, DN originally ₁and DN _mbetween monitoring relation be corrected for DN ₁and DN _m+1between, DN _m+1and DN _mbetween monitoring relation.

Step 4: monitoring server also needs DN _m+1add the situation of system to notice to task dispatcher, follow-up while having new task again, task dispatcher can select allocating task to DN _m.

While inefficacy when individual data server breaks down in step 2, the collaborative method for supervising of its server was as follows:

Step 1: if server DN _i+1continuous several heart beat cycle (as continuous 3 cycles) is not from Q _iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;

Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN _i+1, then by finding DN in local monitoring routing table _i+1monitored object (is DN _i+1before the server that continues) be DN _i

Step 3: monitoring server is judged DN _ifault, then upgrades monitoring routing table: server DN in first showing _i(PreNode, PreQID) information (DN in corresponding record _i-1, Q _i-1) extract to upgrade DN _i+1(PreNode, PreQID) information in corresponding record, then by DN _icorresponding record is deleted;

Step 4: monitoring server is according to DN _i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN _i+1, server DN _i+1subscribe to and be designated Q to message router application _i-1heartbeat queue.Now DN _i+1and DN _i-1set up monitoring relation, unidirectional loop network topology has also been rebuild.

Step 5: monitoring server also needs DN _ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN _iunless, DN _irecover normal presence and rejoin system.

In step 2, in the time that data server in blocks loses efficacy, the collaborative method for supervising of its server was as follows:

Step 1: at server DN _i+1to continuous several cycles not from Q _iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;

Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN _i+1, then by finding DN in local monitoring routing table _i+1monitored object is DN _i, judge DN _ifault;

Step 3: monitoring server upgrades monitoring routing table: server DN in first showing _i(PreNode, PreQID) information (DN in corresponding record _i-1, Q _i-1) extract to upgrade DN _i+1(PreNode, PreQID) information in corresponding record, then by DN _icorresponding record is deleted;

Step 4: monitoring server is according to DN _i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN _i+1, server DN _i+1subscribe to and be designated Q to message router application _i-1heartbeat queue, DN _i+1and DN _i-1set up monitoring relation;

Step 6: due to DN _i-1also because fault had lost efficacy, therefore same, at server DN _i+1continuous several cycle is not from Q _i-1in the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once again, this message bag is by the global monitoring queue being received on message router;

Step 7: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN _i+1, then by finding DN in local monitoring routing table _i+1monitored object is DN _i-1, judge DN _i-1fault;

Step 8: monitoring server upgrades monitoring routing table: server DN in first showing _i-1(PreNode, PreQID) information (DN in corresponding record _i-2, Q _i-2) extract to upgrade DN _i+1(PreNode, PreQID) information in corresponding record, then by DN _i-1corresponding record is deleted;

Step 9: monitoring server is according to DN _i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN _i+1, server DN _i+1subscribe to and be designated Q to message router application _i-2heartbeat queue, DN _i+1and DN _i-2set up monitoring relation, unidirectional loop network topology has also been rebuild again;

Step 10: monitoring server also needs DN _i-1the situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN _i-1unless, DN _i-1recover normal presence and rejoin system.

When more data server lost efficacy, adopt the flow process identical with step 6～step 9 to find successively all fail data servers.

The collaborative method for supervising of the server towards extensive cloud data center that the present invention proposes, can reach following beneficial effect:

(1) response time.In small-scale cloud data center, in the situation that adopting same monitoring interval identical with interval number of times and server failure rate, the collaborative response time of monitoring mechanism and the response time of centralized monitoring mechanism are similar to, because while there is server inefficacy at every turn, compared with centralized monitoring mechanism, collaborative monitoring mechanism need to increase once communication, but monitoring server in centralized monitoring mechanism need to be processed more server heartbeat message; In large-scale cloud data center, the response time of collaborative monitoring mechanism will obviously be less than the response time of centralized monitoring mechanism.

(2) load balancing.Collaborative monitoring mechanism is distributed to the monitor task between server on each data server, monitoring server only data server first or rejoin system, while breaking down, just can receive relevant information, need a large amount of normal server information to be processed to transfer to each data server to receive and process by monitoring server in original monitor procedure, effectively realized load balancing.

(3) upgrade expense.In the time that data server adds system or data server to lose efficacy, the network topology of supervisory control system will change, need to upgrade the information on part server in order to rebuild network topology, but while newly having added a server or a server to occur to lose efficacy at every turn, only to monitoring server and with it adjacent those two data servers exert an influence, and essence only affects and is responsible for current this of monitoring and newly adds or the data server of failed server; For the amendment of monitoring routing table, also only need wherein two records of amendment, other records unaffected, upgrades expense very low.

(4) detection efficiency.No matter be that data server occurs in the situation of discrete inefficacy, or in the situation that server lost efficacy in flakes, collaborative monitoring mechanism all can effectively detect whole failed servers; Although it is longer to detect the time that all failed servers spend when in flakes data server lost efficacy, owing to occurring, the probability that data server in blocks lost efficacy is very low, therefore limited on the overall performance impact of system.

Brief description of the drawings

Fig. 1 is collaborative monitor network topological diagram.

Fig. 2 is the collaborative monitoring model figure of server.

Embodiment

Even current cloud computing system adopts the cloud data center of low-cost server, within a period of time, the ratio that the server that occurs to lose efficacy accounts for server sum remains lower, therefore reports " extremely " than reporting " normally " and obviously can reduce number of communications.The collaborative method for supervising of the server towards extensive cloud data center that the present invention proposes is followed the principle of " if data server does not send information to monitoring server on one's own initiative; this data server is defaulted as normally ", and data server is only occur just can be on one's own initiative to monitoring server transmission information when abnormal.Problem is to occur when abnormal when server self, self will lose the ability that sends abnormal information to main control server, therefore collaborative monitoring mechanism relies on mutual perception, the supervision each other between data server, and the abnormal information of server is to be reported to main control server by its neighbours' server.

1, unidirectional loop topological structure

First server cluster in cloud computing system forms a kind of unidirectional loop topological structure for collaborative monitoring, and server joins end to end, and monitor mode between server is unidirectional, as shown in Figure 1: DN ₁before the server (PreNode) that continues be DN ₈, DN ₁follow-up server (PostNode) be DN ₂, DN ₁only be subject to DN ₂supervision, DN ₁only charge of overseeing DN ₈state, if DN ₁break down and lost efficacy, by DN ₂be responsible for DN ₁the information breaking down and lost efficacy is reported to monitoring server MN.

The collaborative method for supervising of the server towards extensive cloud data center that the present invention proposes comprises following link:

(1) in system newly add server and other server to set up before continue and follow-up relation, server selects to be responsible for the server of monitoring, and by which server monitoring, and before obtaining, continues and the information such as the IP address of follow-up server.For example DN ₁know it before the server that continues be DN ₈the server that then continues is DN ₂, instead of other server.

(2) while inefficacy when server breaks down, original network topology is destroyed, and method is with the complete ring topology of low expense Fast Reconstruction.As work as DN ₁while delaying machine, DN ₂and DN ₈set up rapidly collaborative monitoring relation, and in system with DN ₁server without direct correlation is not affected.

(3) while inefficacy when server in blocks almost breaks down simultaneously, be that the server of certain server and this server of load monitoring was while all losing efficacy, monitoring server obtains all server fail messages fast, and the network topology of heavy damage is by Fast Reconstruction.For example, work as DN ₁and DN ₂while inefficacy simultaneously, DN ₁and DN ₂abnormal information is all reported to monitoring server fast.

2, collaborative monitoring model

In order to address the above problem, first need to build the collaborative monitoring model of server that is applicable to extensive cloud computing system, as shown in Figure 2, relate generally to following functions assembly:

(1) monitoring server.Monitoring server is responsible for the running status of Servers-all in supervisory control system, and the server state information of obtaining can offer the overall situation of cloud system keeper in order to quick grasp system; Collaborative monitoring relation between maintenance server, the normal operation of safeguards system; , as performer or ISP's etc. selection foundation, the information particularly server being lost efficacy sends to the modules such as task dispatcher in time to offer other module in system (as task dispatcher etc.).

(2) message router.Message router can be used as the core component that carries out information exchange between Servers-all in system, module; Adopt level message queue agreement, can safeguard one or more message queues.

(3) message queue.Each message queue is the signature buffering area being present on message router, and each message queue all has a unique queue identity QID, and each message all has theme Topic; Server can be bound with specific message queue; In collaborative supervisory control system, relate to two class message queues: heartbeat queue and global monitoring queue.Each server corresponding the jumping queue of uniting as one, server is periodically issued heartbeat message to own corresponding heartbeat queue, and other server can be from this queue subscribe message.In global monitoring queue whole system, only there is one, relate to the message bag of 2 kinds of themes: " login " message bag and " fault " message bag; Mark NID, IP address and heartbeat queue identity QID that " login " message comprises the server that sends this information; " fault " message handbag is containing the mark NID of the server of this information of transmission.

(4) monitoring routing table.Monitoring routing table is present on monitoring server, an one-way circulation chained list in essence, in table, each record can 5 tuples be described: (NID, IP, QID, PreNode, PreQID) wherein NID is server identification, and IP is the IP address of server, and QID is the mark of the heartbeat queue that server is corresponding, the server that continues before PreNode current server, PreQID refer to current server before the continue mark of heartbeat queue of server.Typical monitoring routing table is as shown in table 1.

Table 1 is monitored routing table

NID	IP	QID	PreNode	PreQID
					DN ₁	198.1.1.1	Q ₁	DN _m	Q _m
DN ₂	198.1.1.2	Q ₂	DN ₁	Q ₁
					……	……	……	……	……
DN _i-1	198.1.1.(i-1)	Q _i-1	DN _i-2	Q _i-2
					DN _i	198.1.1.i	Q _i	DN _i-1	Q _i-1
DN _i+1	198.1.1.(i+1)	Q _i+1	DN _i	Q _i
					……	……	……	……	……
DN _m	198.1.1.m	Q _m	DN _m-1	Q _m-1

(5) finger daemon.Finger daemon resides on each server, is responsible for representative server and other assembly and carries out alternately, and corresponding supervisory control system, is to be mainly responsible for data publish to arrive specific message queue, or subscribes to from specific message queue the message needing.

3, collaborative monitoring flow process when data server adds system

Collaborative monitoring flow process when data server adds system is as follows:

Step 1: first system judges that the current data server that adds system is to add first system or rejoin system; If data server is while adding system first, will be connected with message router by finger daemon, request message router is an independently heartbeat queue of this server establishment, then this server is periodically issued heartbeat message to this heartbeat queue, once there is N server to add system if having in system, no matter whether these servers are online at present, on message router, all will there is N heartbeat queue, in follow-up phase, server will periodically give out information to one's own heartbeat queue; If data server is to rejoin system, when message router has had this server to add system first, be the heartbeat queue of its establishment, needn't re-create, server is periodically issued heartbeat message to own corresponding heartbeat queue;

Step 2: data server is initiatively reported to monitoring server, concrete way is to issue a message bag that themes as " login " to message router, this message bag is by the global monitoring queue being received on message router;

Step 3: monitoring server has been subscribed to this global monitoring queue in the time of initialization, in the time that monitoring server obtains " login " message bag, from this message bag, extract at once the server info (NID of required monitoring, IP, QID), this information is inserted in the monitoring routing table of monitoring server local maintenance, the relative recording of the follow-up server to this server in monitoring routing table is modified, and relevant information is sent to corresponding server, to rebuild unidirectional loop network topology.

Concrete modification method is as follows:

Step (1): establish data server DN _m+1add system, monitoring server is DN according to the information in its " login " message bag at monitoring routing table end _m+1a newly-increased record, the simultaneously DN in modification list ₁record corresponding information, as shown in table 2.Add data server DN _m+1information mean DN _m+1be inserted into DN ₁and DN _mbetween, therefore by original DN ₁before the pointer (PreNode, PreQID) that continues (be server DN _mserver identification (server DN _mserver identification just use DN _mrepresent, hereinafter the server identification of other server also adopts same method to represent) and corresponding heartbeat queue identity Q _m) insert DN _m+1before record, continue in pointer respective items, then revise DN ₁before the pointer that continues be (DN _m+1, Q _m+1).

Table 2 increases DN newly _m+1the monitoring routing table of record

NID	IP	QID	PreNode	PreQID
					DN ₁	198.1.1.1	Q ₁	DN _m+1	Q _m+1
DN ₂	198.1.1.2	Q ₂	DN ₁	Q ₁
					……	……	……	……	……
DN _i-1	198.1.1.(i-1)	Q _i-1	DN _i-2	Q _i-2
					DN _i	198.1.1.i	Q _i	DN _i-1	Q _i-1
DN _i+1	198.1.1.(i+1)	Q _i+1	DN _i	Q _i
					……	……	……	……	……
DN _m	198.1.1.m	Q _m	DN _m-1	Q _m-1
					DN _m+1	198.1.1.(m+1)	Q _m+1	DN _m	Q _m

4, collaborative monitoring flow process when individual data server lost efficacy

Single or sparse server is (with server DN _i+1for example) the collaborative monitoring flow process that breaks down while inefficacy is as follows:

Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN _i+1, then by finding DN in local monitoring routing table _i+1monitored object (server before continues) is DN _i, as shown in table 3.

Table 3DN _imonitoring routing table after breaking down

Step 3: monitoring server is judged DN _ifault, then upgrades monitoring routing table: server DN in first showing _i(PreNode, PreQID) information in corresponding record (is server DN _i-1server identification and corresponding heartbeat queue identity Q _i-1) extract to upgrade DN _i+1(PreNode, PreQID) information in corresponding record, then by DN _icorresponding record is deleted;

5, collaborative monitoring flow process when data server lost efficacy in flakes

The continuous failure probability of server is very low in current cloud data center environment in flakes, but still is contingent, therefore needs under the circumstances.As shown in table 4, suppose current server DN _iwith server DN _i-1almost break down simultaneously and lost efficacy, this means now, being responsible for monitoring DN _i-1dN _ican not obtain DN by monitoring server by issuing " fault " message bag to message router _i-1situation about losing efficacy.But DN _i+1do not lose efficacy, therefore DN _i+1possesses monitoring DN _iability.

Table 4 server DN _iand DN _i-1monitoring routing table after losing efficacy continuously

NID	IP	QID	PreNode	PreQID
					DN ₁	198.1.1.1	Q ₁	DN _m+ ₁	Q _m+ ₁
DN ₂	198.1.1.2	Q ₂	DN ₁	Q ₁
					……	……	……	……	……
DN _i-2	198.1.1.(i-2)	Q _i-2	DN _i-3	Q _i-3
					DN _i-1	198.1.1.(i-1)	Q _i-1	DN _i-2	Q _i-2
DN _i	198.1.1.i	Q _i	DN _i-1	Q _i-1
					DN _i+1	198.1.1.(i+1)	Q _i+1	DN _i	Q _i
……	……	……	……	……
					DN _m	198.1.1.m	Q _m	DN _m-1	Q _m-1
DN _m+1	198.1.1.(m+1)	Q _m+1	DN _m	Q _m

In flakes collaborative monitoring flow process when continuous inefficacy of server is as follows:

Step 1: at server DN _i+1continuous several cycle is not from Q _iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;

Step 3: monitoring server upgrades monitoring routing table: server DN in first showing _i(PreNode, PreQID) information in corresponding record (is server DN _i-1server identification and corresponding heartbeat queue identity Q _i-1) extract to upgrade DN _i+1(PreNode, PreQID) information in corresponding record, then by DN _icorresponding record is deleted;

Step 4: monitoring server is according to DN _i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN _i+1, server DN _i+1subscribe to and be designated Q to message router application _i-1heartbeat queue, DN _i+1and DN _i-1set up monitoring relation, as shown in table 5;

Table 5 is deleted failed server DN _ithe monitoring routing table of record

Step 5: monitoring server also needs DN _isituation about breaking down and lost efficacy is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN _iunless, DN _irecover normal presence and rejoin system.

Step 6: due to DN _i-1also because fault had lost efficacy, therefore same, at server DN _i+1to continuous several cycles not from Q _i-1in the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once again, this message bag is by the global monitoring queue being received on message router;

Step 8: monitoring server upgrades monitoring routing table: server DN in first showing _i-1(PreNode, PreQID) information in corresponding record (is server DN _i-2server identification and corresponding heartbeat queue identity Q _i-2) extract to upgrade DN _i+1(PreNode, PreQID) information in corresponding record, then by DN _i-1corresponding record is deleted;

Step 9: monitoring server is according to DN _i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN _i+1, server DN _i+1subscribe to and be designated Q to message router application _i-2heartbeat queue, DN _i+1and DN _i-2set up monitoring relation, unidirectional loop network topology has also been rebuild again, as shown in table 6.

Table 6 is deleted failed server DN _i-1the monitoring routing table of record

Step 10: monitoring server also needs DN _i-1situation about breaking down and lost efficacy is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN _i-1unless, DN _i-1recover normal presence and rejoin system.

When more data server lost efficacy in flakes, adopt the flow process identical with step 6～step 9 to find successively all fail data servers.

Claims

1. towards the collaborative method for supervising of server of extensive cloud data center, it is realized based on the collaborative monitoring model of server, and the critical piece of the collaborative monitoring model of server comprises monitoring server, message router, data server, message queue, monitoring routing table, finger daemon; The method of its collaborative monitoring comprises the steps:

2. the collaborative method for supervising of a kind of server towards extensive cloud data center according to claim 1, in its step 2, in the time that data server adds system, the collaborative method for supervising of its server is as follows:

Step 1: first system judges that the current data server that adds system is to add first system or rejoin system; If data server is while adding system first, will be connected with message router by finger daemon, request message router is an independently heartbeat queue of this server establishment, then this server is periodically issued heartbeat message to this heartbeat queue, if once there be N data server to add system in system, no matter whether these servers are online at present, on message router, all will there is N heartbeat queue, in follow-up phase, server will periodically give out information to one's own heartbeat queue; If data server is to rejoin system, when message router has had this server to add system first, be the heartbeat queue of its establishment, needn't re-create, server is periodically issued heartbeat message to own corresponding heartbeat queue;

3. the collaborative method for supervising of a kind of server towards extensive cloud data center according to claim 2, the amending method in its step 3 is specially:

Step (1): establish data server DN _m+1add system, monitoring server is DN according to the information in its " login " message bag at monitoring routing table end _m+1a newly-increased record, the simultaneously DN in modification list ₁record corresponding information, as shown in table 2; Add data server DN _m+1information mean DN _m+1be inserted into DN ₁and DN _mbetween, therefore by original DN ₁before pointer (PreNode, the PreQID) information (DN that continues _m, Q _m) insert DN _m+1before record, continue in pointer respective items, then revise DN ₁before the pointer that continues be (DN _m+1, Q _m+1);

4. the collaborative method for supervising of server towards extensive cloud data center according to one kind of claim 1, it is as follows that while inefficacy when individual data server breaks down in described step 2, its server is worked in coordination with method for supervising:

Step 1: if server DN _i+1continuous several heart beat cycle is not from Q _iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;

Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN _i+1, then by finding DN in local monitoring routing table _i+1the monitored object server that continues be DN _i;

Step 4: monitoring server is according to DN _i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN _i+1, server DN _i+1subscribe to and be designated Q to message router application _i-1heartbeat queue, now DN _i+1and DN _i-1set up monitoring relation, unidirectional loop network topology has also been rebuild;

5. the collaborative method for supervising of server towards extensive cloud data center according to one kind of claim 1, in step 2 in the time that data server in blocks lost efficacy its server to work in coordination with method for supervising as follows:

Step 5: monitoring server also needs DN _ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN _iunless, DN _irecover normal presence and rejoin system;

Step 6: due to DN _i-1also because fault had lost efficacy, therefore same, at server DN _i+1continuous several cycle is not from Q _i-1while obtaining message in the heartbeat queue identifying, issue a message bag that themes as " fault " to message router at once again, this message bag is by the global monitoring queue being received on message router;