The collaborative method for supervising of a kind of server towards extensive cloud data center
Technical field
The present invention relates to information technology type systematic management application, relate in particular to a kind of collaborative method for supervising of server towards extensive cloud data center.
Background technology
Cloud computing based on concentrate the cloud data center building for user provides dynamically, calculating high performance-price ratio, elasticity Expansion, storage and various information service, change architectural framework and the operating mode of conventional information technical industry, received at present the very big concern of academia and industrial circle both at home and abroad.Main Countries government and the large-scale cloud of numerous and confused structure data center of the enterprise institution with appreciable impact power; Google, Baidu, IBM, Microsoft, Yahoo, Amazon, VMware, Salesforce, Huawei etc. have all proposed cloud computing solution separately; The network system that Facebook, YouTube, Taobao, ten thousand nets, Sina etc. are subject to extensively welcoming is also all based on cloud computing platform.
Data server in cloud data center is the physical basis of all resources of actual bearer, and the normal operation of server is the prerequisite that cloud computing system is stable, service is provided efficiently.Therefore, server monitoring mechanism is most important for cloud computing system efficiently.The emphasis that current cloud computing monitoring administrating system is paid close attention to is that resources of virtual machine and behavior are monitored, and the monitoring of server itself is simply adopted to centralized architecture and heartbeat or poll pattern.For example, Google cloud computing system adopts the state of being responsible for monitoring each data server in the server cluster of cloud data center by one or several main control server." Lan Yun " cloud computing platform of IBM adopts Tivoli monitoring software to monitor the server of cloud data center and the implementation status of task, also adopts centralized monitoring framework.Nagios, by the surveillance of the main frame of cloud computing system extensive use and network state, still adopts centralized monitoring framework.The advantage of centralized monitoring framework is that controllability is strong, easy to maintenance flexibly, and defect is that system exists performance bottleneck and Single Point of Faliure problem.
In, in small-scale data center, if adopt centralized monitoring framework, data server regularly sends heartbeat message to report work at present state as performer to monitoring server, prevents that the server delay bringing of losing efficacy from being feasible.But in large-scale cloud data center, obviously can not adopt simple heartbeat mechanism, because the huge data server of quantity all sends cycle heartbeat message to monitoring server and will bring a large amount of extra network service burdens, and the System and Network resources that easily consume in a large number monitoring servers, cause systematic function bottleneck and monitoring server Problem of Failure, to causing the effect that is similar to distributed denial of service attack.
In order to address the above problem, the mode adopting is at present that configuration possesses the monitoring server of high-performance and high availability, and is aided with the functional module such as journal recovery or dual-host backup, also brings thus system cost is raise, not from dealing with problems in essence.
The present invention is directed to the problem that current cloud data center supervisory control system exists, provide the collaborative method for supervising of a kind of server towards extensive cloud data center, the mode monitoring with data server mutual perception, each other substitutes the monitoring mode of centralized architecture, promote the ability of self-management of server, effectively alleviate the monitoring burden of monitoring server, eliminate performance bottleneck and monitoring server failure risk.
Summary of the invention
For solving the problems of the technologies described above, the invention provides the collaborative method for supervising of a kind of server towards extensive cloud data center, its technical scheme adopting is as follows:
Towards the collaborative method for supervising of server of extensive cloud data center, it is realized based on the collaborative monitoring model of server, and the critical piece of the collaborative monitoring model of server comprises monitoring server, message router, data server, message queue, monitoring routing table, finger daemon; The method of its collaborative monitoring comprises the steps:
Step 1: all data servers are connected successively and form unidirectional loop topological structure, server and follow-up server and be subject to follow-up server monitoring continue before each data server has, data server breaks down and while losing efficacy, is responsible for the failure conditions of data server to report monitoring server by the server that continues thereafter;
Step 2: the collaborative method for supervising of its server is in the time that data server adds system: re-establish the unidirectional loop topological structure that comprises this new data service device, monitoring server adds the situation of system to notice new data service device to task dispatcher;
When individual data server break down and while losing efficacy the collaborative method for supervising of its server be: the follow-up server of this data server be responsible for finding and by this situation report to monitoring server, re-establish the unidirectional loop topological structure of getting rid of this fault data server, monitoring server notices the situation of this data server fault to task dispatcher, and proceeds monitoring;
In the time that data server in blocks loses efficacy, the collaborative method for supervising of its server was: be responsible for finding successively that by follow-up first normal data server in these fail data servers in blocks the situation report also successively data server being lost efficacy is to monitoring server, re-establish the unidirectional loop topological structure of getting rid of this fault data server, monitoring server is noticed the situation of all data server faults to task dispatcher successively, and proceeds monitoring.
In step 2, in the time that data server adds system, the collaborative method for supervising of its server is as follows:
Step 1: first system judges that the current data server that adds system is to add first system or rejoin system; If data server is while adding system first, will be connected with message router by finger daemon, request message router is an independently heartbeat queue of this server establishment, if once there be N data server to add system in system, no matter whether these servers are online at present, on message router, all will have N heartbeat queue, in follow-up phase, server will periodically give out information to one's own heartbeat queue; If data server is to rejoin system, when message router has had this server to add system first, be the heartbeat queue of its establishment, needn't re-create;
Step 2: data server is initiatively reported to monitoring server, concrete way is to issue a message bag that themes as " login " to message router, in the coated global monitoring queue of inserting on message router of this message;
Step 3: monitoring server has been subscribed to this global monitoring queue in the time of initialization, in the time that monitoring server obtains " login " message bag, from this message bag, extract at once the server info (NID of required monitoring, IP, QID), this information is inserted in the monitoring routing table of monitoring server local maintenance, the relative recording of the follow-up server to this server in monitoring routing table is modified, and relevant information is sent to this follow-up server, to rebuild unidirectional loop network topology.
Amending method in step 3 is specially:
Step (1): establish data server DN
m+1add system, monitoring server is DN according to the information in its " login " message bag at monitoring routing table end
m+1a newly-increased record, the simultaneously DN in modification list
1record corresponding information, as shown in table 2.Add data server DN
m+1information mean DN
m+1be inserted into DN
1and DN
mbetween, therefore by original DN
1before pointer (PreNode, the PreQID) information (DN that continues
m, Q
m) insert DN
m+1before record, continue in pointer respective items, then revise DN
1before the pointer that continues be (DN
m+1, Q
m+1).
Step (2): monitoring server will show in server DN
m+1pointer (DN continues before in corresponding record
m, Q
m) according to DN
m+1iP address send to again server DN
m+1; Also by DN
1pointer information (DN continues before amended in corresponding record
m+1, Q
m+1) according to DN
1iP address send to again server DN
1;
Step (3): server DN
m+1subscribe to and be designated Q to message router application
mheartbeat queue, server DN
1subscribe to and be designated Q to message router application
m+1heartbeat queue; Network topology has been rebuild, DN originally
1and DN
mbetween monitoring relation be corrected for DN
1and DN
m+1between, DN
m+1and DN
mbetween monitoring relation.
Step 4: monitoring server also needs DN
m+1add the situation of system to notice to task dispatcher, follow-up while having new task again, task dispatcher can select allocating task to DN
m.
While inefficacy when individual data server breaks down in step 2, the collaborative method for supervising of its server was as follows:
Step 1: if server DN
i+1continuous several heart beat cycle (as continuous 3 cycles) is not from Q
iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN
i+1, then by finding DN in local monitoring routing table
i+1monitored object (is DN
i+1before the server that continues) be DN
i
Step 3: monitoring server is judged DN
ifault, then upgrades monitoring routing table: server DN in first showing
i(PreNode, PreQID) information (DN in corresponding record
i-1, Q
i-1) extract to upgrade DN
i+1(PreNode, PreQID) information in corresponding record, then by DN
icorresponding record is deleted;
Step 4: monitoring server is according to DN
i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN
i+1, server DN
i+1subscribe to and be designated Q to message router application
i-1heartbeat queue.Now DN
i+1and DN
i-1set up monitoring relation, unidirectional loop network topology has also been rebuild.
Step 5: monitoring server also needs DN
ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN
iunless, DN
irecover normal presence and rejoin system.
In step 2, in the time that data server in blocks loses efficacy, the collaborative method for supervising of its server was as follows:
Step 1: at server DN
i+1to continuous several cycles not from Q
iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN
i+1, then by finding DN in local monitoring routing table
i+1monitored object is DN
i, judge DN
ifault;
Step 3: monitoring server upgrades monitoring routing table: server DN in first showing
i(PreNode, PreQID) information (DN in corresponding record
i-1, Q
i-1) extract to upgrade DN
i+1(PreNode, PreQID) information in corresponding record, then by DN
icorresponding record is deleted;
Step 4: monitoring server is according to DN
i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN
i+1, server DN
i+1subscribe to and be designated Q to message router application
i-1heartbeat queue, DN
i+1and DN
i-1set up monitoring relation;
Step 5: monitoring server also needs DN
ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN
iunless, DN
irecover normal presence and rejoin system.
Step 6: due to DN
i-1also because fault had lost efficacy, therefore same, at server DN
i+1continuous several cycle is not from Q
i-1in the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once again, this message bag is by the global monitoring queue being received on message router;
Step 7: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN
i+1, then by finding DN in local monitoring routing table
i+1monitored object is DN
i-1, judge DN
i-1fault;
Step 8: monitoring server upgrades monitoring routing table: server DN in first showing
i-1(PreNode, PreQID) information (DN in corresponding record
i-2, Q
i-2) extract to upgrade DN
i+1(PreNode, PreQID) information in corresponding record, then by DN
i-1corresponding record is deleted;
Step 9: monitoring server is according to DN
i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN
i+1, server DN
i+1subscribe to and be designated Q to message router application
i-2heartbeat queue, DN
i+1and DN
i-2set up monitoring relation, unidirectional loop network topology has also been rebuild again;
Step 10: monitoring server also needs DN
i-1the situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN
i-1unless, DN
i-1recover normal presence and rejoin system.
When more data server lost efficacy, adopt the flow process identical with step 6~step 9 to find successively all fail data servers.
The collaborative method for supervising of the server towards extensive cloud data center that the present invention proposes, can reach following beneficial effect:
(1) response time.In small-scale cloud data center, in the situation that adopting same monitoring interval identical with interval number of times and server failure rate, the collaborative response time of monitoring mechanism and the response time of centralized monitoring mechanism are similar to, because while there is server inefficacy at every turn, compared with centralized monitoring mechanism, collaborative monitoring mechanism need to increase once communication, but monitoring server in centralized monitoring mechanism need to be processed more server heartbeat message; In large-scale cloud data center, the response time of collaborative monitoring mechanism will obviously be less than the response time of centralized monitoring mechanism.
(2) load balancing.Collaborative monitoring mechanism is distributed to the monitor task between server on each data server, monitoring server only data server first or rejoin system, while breaking down, just can receive relevant information, need a large amount of normal server information to be processed to transfer to each data server to receive and process by monitoring server in original monitor procedure, effectively realized load balancing.
(3) upgrade expense.In the time that data server adds system or data server to lose efficacy, the network topology of supervisory control system will change, need to upgrade the information on part server in order to rebuild network topology, but while newly having added a server or a server to occur to lose efficacy at every turn, only to monitoring server and with it adjacent those two data servers exert an influence, and essence only affects and is responsible for current this of monitoring and newly adds or the data server of failed server; For the amendment of monitoring routing table, also only need wherein two records of amendment, other records unaffected, upgrades expense very low.
(4) detection efficiency.No matter be that data server occurs in the situation of discrete inefficacy, or in the situation that server lost efficacy in flakes, collaborative monitoring mechanism all can effectively detect whole failed servers; Although it is longer to detect the time that all failed servers spend when in flakes data server lost efficacy, owing to occurring, the probability that data server in blocks lost efficacy is very low, therefore limited on the overall performance impact of system.
Brief description of the drawings
Fig. 1 is collaborative monitor network topological diagram.
Fig. 2 is the collaborative monitoring model figure of server.
Embodiment
Even current cloud computing system adopts the cloud data center of low-cost server, within a period of time, the ratio that the server that occurs to lose efficacy accounts for server sum remains lower, therefore reports " extremely " than reporting " normally " and obviously can reduce number of communications.The collaborative method for supervising of the server towards extensive cloud data center that the present invention proposes is followed the principle of " if data server does not send information to monitoring server on one's own initiative; this data server is defaulted as normally ", and data server is only occur just can be on one's own initiative to monitoring server transmission information when abnormal.Problem is to occur when abnormal when server self, self will lose the ability that sends abnormal information to main control server, therefore collaborative monitoring mechanism relies on mutual perception, the supervision each other between data server, and the abnormal information of server is to be reported to main control server by its neighbours' server.
1, unidirectional loop topological structure
First server cluster in cloud computing system forms a kind of unidirectional loop topological structure for collaborative monitoring, and server joins end to end, and monitor mode between server is unidirectional, as shown in Figure 1: DN
1before the server (PreNode) that continues be DN
8, DN
1follow-up server (PostNode) be DN
2, DN
1only be subject to DN
2supervision, DN
1only charge of overseeing DN
8state, if DN
1break down and lost efficacy, by DN
2be responsible for DN
1the information breaking down and lost efficacy is reported to monitoring server MN.
The collaborative method for supervising of the server towards extensive cloud data center that the present invention proposes comprises following link:
(1) in system newly add server and other server to set up before continue and follow-up relation, server selects to be responsible for the server of monitoring, and by which server monitoring, and before obtaining, continues and the information such as the IP address of follow-up server.For example DN
1know it before the server that continues be DN
8the server that then continues is DN
2, instead of other server.
(2) while inefficacy when server breaks down, original network topology is destroyed, and method is with the complete ring topology of low expense Fast Reconstruction.As work as DN
1while delaying machine, DN
2and DN
8set up rapidly collaborative monitoring relation, and in system with DN
1server without direct correlation is not affected.
(3) while inefficacy when server in blocks almost breaks down simultaneously, be that the server of certain server and this server of load monitoring was while all losing efficacy, monitoring server obtains all server fail messages fast, and the network topology of heavy damage is by Fast Reconstruction.For example, work as DN
1and DN
2while inefficacy simultaneously, DN
1and DN
2abnormal information is all reported to monitoring server fast.
2, collaborative monitoring model
In order to address the above problem, first need to build the collaborative monitoring model of server that is applicable to extensive cloud computing system, as shown in Figure 2, relate generally to following functions assembly:
(1) monitoring server.Monitoring server is responsible for the running status of Servers-all in supervisory control system, and the server state information of obtaining can offer the overall situation of cloud system keeper in order to quick grasp system; Collaborative monitoring relation between maintenance server, the normal operation of safeguards system; , as performer or ISP's etc. selection foundation, the information particularly server being lost efficacy sends to the modules such as task dispatcher in time to offer other module in system (as task dispatcher etc.).
(2) message router.Message router can be used as the core component that carries out information exchange between Servers-all in system, module; Adopt level message queue agreement, can safeguard one or more message queues.
(3) message queue.Each message queue is the signature buffering area being present on message router, and each message queue all has a unique queue identity QID, and each message all has theme Topic; Server can be bound with specific message queue; In collaborative supervisory control system, relate to two class message queues: heartbeat queue and global monitoring queue.Each server corresponding the jumping queue of uniting as one, server is periodically issued heartbeat message to own corresponding heartbeat queue, and other server can be from this queue subscribe message.In global monitoring queue whole system, only there is one, relate to the message bag of 2 kinds of themes: " login " message bag and " fault " message bag; Mark NID, IP address and heartbeat queue identity QID that " login " message comprises the server that sends this information; " fault " message handbag is containing the mark NID of the server of this information of transmission.
(4) monitoring routing table.Monitoring routing table is present on monitoring server, an one-way circulation chained list in essence, in table, each record can 5 tuples be described: (NID, IP, QID, PreNode, PreQID) wherein NID is server identification, and IP is the IP address of server, and QID is the mark of the heartbeat queue that server is corresponding, the server that continues before PreNode current server, PreQID refer to current server before the continue mark of heartbeat queue of server.Typical monitoring routing table is as shown in table 1.
Table 1 is monitored routing table
NID |
IP |
QID |
PreNode |
PreQID |
DN
1 |
198.1.1.1 |
Q
1 |
DN
m |
Q
m |
DN
2 |
198.1.1.2 |
Q
2 |
DN
1 |
Q
1 |
…… |
…… |
…… |
…… |
…… |
DN
i-1 |
198.1.1.(i-1) |
Q
i-1 |
DN
i-2 |
Q
i-2 |
DN
i |
198.1.1.i |
Q
i |
DN
i-1 |
Q
i-1 |
DN
i+1 |
198.1.1.(i+1) |
Q
i+1 |
DN
i |
Q
i |
…… |
…… |
…… |
…… |
…… |
DN
m |
198.1.1.m |
Q
m |
DN
m-1 |
Q
m-1 |
(5) finger daemon.Finger daemon resides on each server, is responsible for representative server and other assembly and carries out alternately, and corresponding supervisory control system, is to be mainly responsible for data publish to arrive specific message queue, or subscribes to from specific message queue the message needing.
3, collaborative monitoring flow process when data server adds system
Collaborative monitoring flow process when data server adds system is as follows:
Step 1: first system judges that the current data server that adds system is to add first system or rejoin system; If data server is while adding system first, will be connected with message router by finger daemon, request message router is an independently heartbeat queue of this server establishment, then this server is periodically issued heartbeat message to this heartbeat queue, once there is N server to add system if having in system, no matter whether these servers are online at present, on message router, all will there is N heartbeat queue, in follow-up phase, server will periodically give out information to one's own heartbeat queue; If data server is to rejoin system, when message router has had this server to add system first, be the heartbeat queue of its establishment, needn't re-create, server is periodically issued heartbeat message to own corresponding heartbeat queue;
Step 2: data server is initiatively reported to monitoring server, concrete way is to issue a message bag that themes as " login " to message router, this message bag is by the global monitoring queue being received on message router;
Step 3: monitoring server has been subscribed to this global monitoring queue in the time of initialization, in the time that monitoring server obtains " login " message bag, from this message bag, extract at once the server info (NID of required monitoring, IP, QID), this information is inserted in the monitoring routing table of monitoring server local maintenance, the relative recording of the follow-up server to this server in monitoring routing table is modified, and relevant information is sent to corresponding server, to rebuild unidirectional loop network topology.
Concrete modification method is as follows:
Step (1): establish data server DN
m+1add system, monitoring server is DN according to the information in its " login " message bag at monitoring routing table end
m+1a newly-increased record, the simultaneously DN in modification list
1record corresponding information, as shown in table 2.Add data server DN
m+1information mean DN
m+1be inserted into DN
1and DN
mbetween, therefore by original DN
1before the pointer (PreNode, PreQID) that continues (be server DN
mserver identification (server DN
mserver identification just use DN
mrepresent, hereinafter the server identification of other server also adopts same method to represent) and corresponding heartbeat queue identity Q
m) insert DN
m+1before record, continue in pointer respective items, then revise DN
1before the pointer that continues be (DN
m+1, Q
m+1).
Table 2 increases DN newly
m+1the monitoring routing table of record
NID |
IP |
QID |
PreNode |
PreQID |
DN
1 |
198.1.1.1 |
Q
1 |
DN
m+1 |
Q
m+1 |
DN
2 |
198.1.1.2 |
Q
2 |
DN
1 |
Q
1 |
…… |
…… |
…… |
…… |
…… |
DN
i-1 |
198.1.1.(i-1) |
Q
i-1 |
DN
i-2 |
Q
i-2 |
DN
i |
198.1.1.i |
Q
i |
DN
i-1 |
Q
i-1 |
DN
i+1 |
198.1.1.(i+1) |
Q
i+1 |
DN
i |
Q
i |
…… |
…… |
…… |
…… |
…… |
DN
m |
198.1.1.m |
Q
m |
DN
m-1 |
Q
m-1 |
DN
m+1 |
198.1.1.(m+1) |
Q
m+1 |
DN
m |
Q
m |
Step (2): monitoring server will show in server DN
m+1pointer (DN continues before in corresponding record
m, Q
m) according to DN
m+1iP address send to again server DN
m+1; Also by DN
1pointer information (DN continues before amended in corresponding record
m+1, Q
m+1) according to DN
1iP address send to again server DN
1;
Step (3): server DN
m+1subscribe to and be designated Q to message router application
mheartbeat queue, server DN
1subscribe to and be designated Q to message router application
m+1heartbeat queue; Network topology has been rebuild, DN originally
1and DN
mbetween monitoring relation be corrected for DN
1and DN
m+1between, DN
m+1and DN
mbetween monitoring relation.
Step 4: monitoring server also needs DN
m+1add the situation of system to notice to task dispatcher, follow-up while having new task again, task dispatcher can select allocating task to DN
m.
4, collaborative monitoring flow process when individual data server lost efficacy
Single or sparse server is (with server DN
i+1for example) the collaborative monitoring flow process that breaks down while inefficacy is as follows:
Step 1: if server DN
i+1continuous several heart beat cycle (as continuous 3 cycles) is not from Q
iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN
i+1, then by finding DN in local monitoring routing table
i+1monitored object (server before continues) is DN
i, as shown in table 3.
Table 3DN
imonitoring routing table after breaking down
Step 3: monitoring server is judged DN
ifault, then upgrades monitoring routing table: server DN in first showing
i(PreNode, PreQID) information in corresponding record (is server DN
i-1server identification and corresponding heartbeat queue identity Q
i-1) extract to upgrade DN
i+1(PreNode, PreQID) information in corresponding record, then by DN
icorresponding record is deleted;
Step 4: monitoring server is according to DN
i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN
i+1, server DN
i+1subscribe to and be designated Q to message router application
i-1heartbeat queue.Now DN
i+1and DN
i-1set up monitoring relation, unidirectional loop network topology has also been rebuild.
Step 5: monitoring server also needs DN
ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN
iunless, DN
irecover normal presence and rejoin system.
5, collaborative monitoring flow process when data server lost efficacy in flakes
The continuous failure probability of server is very low in current cloud data center environment in flakes, but still is contingent, therefore needs under the circumstances.As shown in table 4, suppose current server DN
iwith server DN
i-1almost break down simultaneously and lost efficacy, this means now, being responsible for monitoring DN
i-1dN
ican not obtain DN by monitoring server by issuing " fault " message bag to message router
i-1situation about losing efficacy.But DN
i+1do not lose efficacy, therefore DN
i+1possesses monitoring DN
iability.
Table 4 server DN
iand DN
i-1monitoring routing table after losing efficacy continuously
NID |
IP |
QID |
PreNode |
PreQID |
DN
1 |
198.1.1.1 |
Q
1 |
DN
m+
1 |
Q
m+
1 |
DN
2 |
198.1.1.2 |
Q
2 |
DN
1 |
Q
1 |
…… |
…… |
…… |
…… |
…… |
DN
i-2 |
198.1.1.(i-2) |
Q
i-2 |
DN
i-3 |
Q
i-3 |
DN
i-1 |
198.1.1.(i-1) |
Q
i-1 |
DN
i-2 |
Q
i-2 |
DN
i |
198.1.1.i |
Q
i |
DN
i-1 |
Q
i-1 |
DN
i+1 |
198.1.1.(i+1) |
Q
i+1 |
DN
i |
Q
i |
…… |
…… |
…… |
…… |
…… |
DN
m |
198.1.1.m |
Q
m |
DN
m-1 |
Q
m-1 |
DN
m+1 |
198.1.1.(m+1) |
Q
m+1 |
DN
m |
Q
m |
In flakes collaborative monitoring flow process when continuous inefficacy of server is as follows:
Step 1: at server DN
i+1continuous several cycle is not from Q
iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN
i+1, then by finding DN in local monitoring routing table
i+1monitored object is DN
i, judge DN
ifault;
Step 3: monitoring server upgrades monitoring routing table: server DN in first showing
i(PreNode, PreQID) information in corresponding record (is server DN
i-1server identification and corresponding heartbeat queue identity Q
i-1) extract to upgrade DN
i+1(PreNode, PreQID) information in corresponding record, then by DN
icorresponding record is deleted;
Step 4: monitoring server is according to DN
i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN
i+1, server DN
i+1subscribe to and be designated Q to message router application
i-1heartbeat queue, DN
i+1and DN
i-1set up monitoring relation, as shown in table 5;
Table 5 is deleted failed server DN
ithe monitoring routing table of record
Step 5: monitoring server also needs DN
isituation about breaking down and lost efficacy is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN
iunless, DN
irecover normal presence and rejoin system.
Step 6: due to DN
i-1also because fault had lost efficacy, therefore same, at server DN
i+1to continuous several cycles not from Q
i-1in the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once again, this message bag is by the global monitoring queue being received on message router;
Step 7: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN
i+1, then by finding DN in local monitoring routing table
i+1monitored object is DN
i-1, judge DN
i-1fault;
Step 8: monitoring server upgrades monitoring routing table: server DN in first showing
i-1(PreNode, PreQID) information in corresponding record (is server DN
i-2server identification and corresponding heartbeat queue identity Q
i-2) extract to upgrade DN
i+1(PreNode, PreQID) information in corresponding record, then by DN
i-1corresponding record is deleted;
Step 9: monitoring server is according to DN
i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN
i+1, server DN
i+1subscribe to and be designated Q to message router application
i-2heartbeat queue, DN
i+1and DN
i-2set up monitoring relation, unidirectional loop network topology has also been rebuild again, as shown in table 6.
Table 6 is deleted failed server DN
i-1the monitoring routing table of record
Step 10: monitoring server also needs DN
i-1situation about breaking down and lost efficacy is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN
i-1unless, DN
i-1recover normal presence and rejoin system.
When more data server lost efficacy in flakes, adopt the flow process identical with step 6~step 9 to find successively all fail data servers.