CN103944784A - Large-scale-cloud-data-center-oriented server cooperative monitoring method - Google Patents

Large-scale-cloud-data-center-oriented server cooperative monitoring method Download PDF

Info

Publication number
CN103944784A
CN103944784A CN201410166275.2A CN201410166275A CN103944784A CN 103944784 A CN103944784 A CN 103944784A CN 201410166275 A CN201410166275 A CN 201410166275A CN 103944784 A CN103944784 A CN 103944784A
Authority
CN
China
Prior art keywords
server
monitoring
message
data
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410166275.2A
Other languages
Chinese (zh)
Other versions
CN103944784B (en
Inventor
徐小龙
杨冠
章韵
李嘉豪
张凯
李爱群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weihai Blue Ocean Bank Co., Ltd
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201410166275.2A priority Critical patent/CN103944784B/en
Publication of CN103944784A publication Critical patent/CN103944784A/en
Application granted granted Critical
Publication of CN103944784B publication Critical patent/CN103944784B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a large-scale-cloud-data-center-oriented server cooperative monitoring method. The mode that servers sense and monitor each other replaces a monitoring mode of a centralized structure, the self-managing capacity of the servers is improved, the monitoring burden of the monitoring servers is effectively lowered, and performance bottlenecks and singe point failure risks are removed. A cooperative monitoring mechanism and a functional component are provided, and the working steps of the cooperative monitoring mechanism are given when data servers are added into a system and a single data server and a group of data servers lose efficacy. The method is used in a large-scale cloud data center, system response time is obviously shorter than the response time of a centralized monitoring mechanism, load balance is effectively achieved, low updating expenditure is achieved, and under the situations that the data servers lose efficacy in a discrete mode and in a group mode, all the data servers which lose efficacy can be detected effectively.

Description

The collaborative method for supervising of a kind of server towards extensive cloud data center
Technical field
The present invention relates to information technology type systematic management application, relate in particular to a kind of collaborative method for supervising of server towards extensive cloud data center.
Background technology
Cloud computing based on concentrate the cloud data center building for user provides dynamically, calculating high performance-price ratio, elasticity Expansion, storage and various information service, change architectural framework and the operating mode of conventional information technical industry, received at present the very big concern of academia and industrial circle both at home and abroad.Main Countries government and the large-scale cloud of numerous and confused structure data center of the enterprise institution with appreciable impact power; Google, Baidu, IBM, Microsoft, Yahoo, Amazon, VMware, Salesforce, Huawei etc. have all proposed cloud computing solution separately; The network system that Facebook, YouTube, Taobao, ten thousand nets, Sina etc. are subject to extensively welcoming is also all based on cloud computing platform.
Data server in cloud data center is the physical basis of all resources of actual bearer, and the normal operation of server is the prerequisite that cloud computing system is stable, service is provided efficiently.Therefore, server monitoring mechanism is most important for cloud computing system efficiently.The emphasis that current cloud computing monitoring administrating system is paid close attention to is that resources of virtual machine and behavior are monitored, and the monitoring of server itself is simply adopted to centralized architecture and heartbeat or poll pattern.For example, Google cloud computing system adopts the state of being responsible for monitoring each data server in the server cluster of cloud data center by one or several main control server." Lan Yun " cloud computing platform of IBM adopts Tivoli monitoring software to monitor the server of cloud data center and the implementation status of task, also adopts centralized monitoring framework.Nagios, by the surveillance of the main frame of cloud computing system extensive use and network state, still adopts centralized monitoring framework.The advantage of centralized monitoring framework is that controllability is strong, easy to maintenance flexibly, and defect is that system exists performance bottleneck and Single Point of Faliure problem.
In, in small-scale data center, if adopt centralized monitoring framework, data server regularly sends heartbeat message to report work at present state as performer to monitoring server, prevents that the server delay bringing of losing efficacy from being feasible.But in large-scale cloud data center, obviously can not adopt simple heartbeat mechanism, because the huge data server of quantity all sends cycle heartbeat message to monitoring server and will bring a large amount of extra network service burdens, and the System and Network resources that easily consume in a large number monitoring servers, cause systematic function bottleneck and monitoring server Problem of Failure, to causing the effect that is similar to distributed denial of service attack.
In order to address the above problem, the mode adopting is at present that configuration possesses the monitoring server of high-performance and high availability, and is aided with the functional module such as journal recovery or dual-host backup, also brings thus system cost is raise, not from dealing with problems in essence.
The present invention is directed to the problem that current cloud data center supervisory control system exists, provide the collaborative method for supervising of a kind of server towards extensive cloud data center, the mode monitoring with data server mutual perception, each other substitutes the monitoring mode of centralized architecture, promote the ability of self-management of server, effectively alleviate the monitoring burden of monitoring server, eliminate performance bottleneck and monitoring server failure risk.
Summary of the invention
For solving the problems of the technologies described above, the invention provides the collaborative method for supervising of a kind of server towards extensive cloud data center, its technical scheme adopting is as follows:
Towards the collaborative method for supervising of server of extensive cloud data center, it is realized based on the collaborative monitoring model of server, and the critical piece of the collaborative monitoring model of server comprises monitoring server, message router, data server, message queue, monitoring routing table, finger daemon; The method of its collaborative monitoring comprises the steps:
Step 1: all data servers are connected successively and form unidirectional loop topological structure, server and follow-up server and be subject to follow-up server monitoring continue before each data server has, data server breaks down and while losing efficacy, is responsible for the failure conditions of data server to report monitoring server by the server that continues thereafter;
Step 2: the collaborative method for supervising of its server is in the time that data server adds system: re-establish the unidirectional loop topological structure that comprises this new data service device, monitoring server adds the situation of system to notice new data service device to task dispatcher;
When individual data server break down and while losing efficacy the collaborative method for supervising of its server be: the follow-up server of this data server be responsible for finding and by this situation report to monitoring server, re-establish the unidirectional loop topological structure of getting rid of this fault data server, monitoring server notices the situation of this data server fault to task dispatcher, and proceeds monitoring;
In the time that data server in blocks loses efficacy, the collaborative method for supervising of its server was: be responsible for finding successively that by follow-up first normal data server in these fail data servers in blocks the situation report also successively data server being lost efficacy is to monitoring server, re-establish the unidirectional loop topological structure of getting rid of this fault data server, monitoring server is noticed the situation of all data server faults to task dispatcher successively, and proceeds monitoring.
In step 2, in the time that data server adds system, the collaborative method for supervising of its server is as follows:
Step 1: first system judges that the current data server that adds system is to add first system or rejoin system; If data server is while adding system first, will be connected with message router by finger daemon, request message router is an independently heartbeat queue of this server establishment, if once there be N data server to add system in system, no matter whether these servers are online at present, on message router, all will have N heartbeat queue, in follow-up phase, server will periodically give out information to one's own heartbeat queue; If data server is to rejoin system, when message router has had this server to add system first, be the heartbeat queue of its establishment, needn't re-create;
Step 2: data server is initiatively reported to monitoring server, concrete way is to issue a message bag that themes as " login " to message router, in the coated global monitoring queue of inserting on message router of this message;
Step 3: monitoring server has been subscribed to this global monitoring queue in the time of initialization, in the time that monitoring server obtains " login " message bag, from this message bag, extract at once the server info (NID of required monitoring, IP, QID), this information is inserted in the monitoring routing table of monitoring server local maintenance, the relative recording of the follow-up server to this server in monitoring routing table is modified, and relevant information is sent to this follow-up server, to rebuild unidirectional loop network topology.
Amending method in step 3 is specially:
Step (1): establish data server DN m+1add system, monitoring server is DN according to the information in its " login " message bag at monitoring routing table end m+1a newly-increased record, the simultaneously DN in modification list 1record corresponding information, as shown in table 2.Add data server DN m+1information mean DN m+1be inserted into DN 1and DN mbetween, therefore by original DN 1before pointer (PreNode, the PreQID) information (DN that continues m, Q m) insert DN m+1before record, continue in pointer respective items, then revise DN 1before the pointer that continues be (DN m+1, Q m+1).
Step (2): monitoring server will show in server DN m+1pointer (DN continues before in corresponding record m, Q m) according to DN m+1iP address send to again server DN m+1; Also by DN 1pointer information (DN continues before amended in corresponding record m+1, Q m+1) according to DN 1iP address send to again server DN 1;
Step (3): server DN m+1subscribe to and be designated Q to message router application mheartbeat queue, server DN 1subscribe to and be designated Q to message router application m+1heartbeat queue; Network topology has been rebuild, DN originally 1and DN mbetween monitoring relation be corrected for DN 1and DN m+1between, DN m+1and DN mbetween monitoring relation.
Step 4: monitoring server also needs DN m+1add the situation of system to notice to task dispatcher, follow-up while having new task again, task dispatcher can select allocating task to DN m.
While inefficacy when individual data server breaks down in step 2, the collaborative method for supervising of its server was as follows:
Step 1: if server DN i+1continuous several heart beat cycle (as continuous 3 cycles) is not from Q iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN i+1, then by finding DN in local monitoring routing table i+1monitored object (is DN i+1before the server that continues) be DN i
Step 3: monitoring server is judged DN ifault, then upgrades monitoring routing table: server DN in first showing i(PreNode, PreQID) information (DN in corresponding record i-1, Q i-1) extract to upgrade DN i+1(PreNode, PreQID) information in corresponding record, then by DN icorresponding record is deleted;
Step 4: monitoring server is according to DN i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN i+1, server DN i+1subscribe to and be designated Q to message router application i-1heartbeat queue.Now DN i+1and DN i-1set up monitoring relation, unidirectional loop network topology has also been rebuild.
Step 5: monitoring server also needs DN ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN iunless, DN irecover normal presence and rejoin system.
In step 2, in the time that data server in blocks loses efficacy, the collaborative method for supervising of its server was as follows:
Step 1: at server DN i+1to continuous several cycles not from Q iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN i+1, then by finding DN in local monitoring routing table i+1monitored object is DN i, judge DN ifault;
Step 3: monitoring server upgrades monitoring routing table: server DN in first showing i(PreNode, PreQID) information (DN in corresponding record i-1, Q i-1) extract to upgrade DN i+1(PreNode, PreQID) information in corresponding record, then by DN icorresponding record is deleted;
Step 4: monitoring server is according to DN i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN i+1, server DN i+1subscribe to and be designated Q to message router application i-1heartbeat queue, DN i+1and DN i-1set up monitoring relation;
Step 5: monitoring server also needs DN ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN iunless, DN irecover normal presence and rejoin system.
Step 6: due to DN i-1also because fault had lost efficacy, therefore same, at server DN i+1continuous several cycle is not from Q i-1in the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once again, this message bag is by the global monitoring queue being received on message router;
Step 7: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN i+1, then by finding DN in local monitoring routing table i+1monitored object is DN i-1, judge DN i-1fault;
Step 8: monitoring server upgrades monitoring routing table: server DN in first showing i-1(PreNode, PreQID) information (DN in corresponding record i-2, Q i-2) extract to upgrade DN i+1(PreNode, PreQID) information in corresponding record, then by DN i-1corresponding record is deleted;
Step 9: monitoring server is according to DN i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN i+1, server DN i+1subscribe to and be designated Q to message router application i-2heartbeat queue, DN i+1and DN i-2set up monitoring relation, unidirectional loop network topology has also been rebuild again;
Step 10: monitoring server also needs DN i-1the situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN i-1unless, DN i-1recover normal presence and rejoin system.
When more data server lost efficacy, adopt the flow process identical with step 6~step 9 to find successively all fail data servers.
The collaborative method for supervising of the server towards extensive cloud data center that the present invention proposes, can reach following beneficial effect:
(1) response time.In small-scale cloud data center, in the situation that adopting same monitoring interval identical with interval number of times and server failure rate, the collaborative response time of monitoring mechanism and the response time of centralized monitoring mechanism are similar to, because while there is server inefficacy at every turn, compared with centralized monitoring mechanism, collaborative monitoring mechanism need to increase once communication, but monitoring server in centralized monitoring mechanism need to be processed more server heartbeat message; In large-scale cloud data center, the response time of collaborative monitoring mechanism will obviously be less than the response time of centralized monitoring mechanism.
(2) load balancing.Collaborative monitoring mechanism is distributed to the monitor task between server on each data server, monitoring server only data server first or rejoin system, while breaking down, just can receive relevant information, need a large amount of normal server information to be processed to transfer to each data server to receive and process by monitoring server in original monitor procedure, effectively realized load balancing.
(3) upgrade expense.In the time that data server adds system or data server to lose efficacy, the network topology of supervisory control system will change, need to upgrade the information on part server in order to rebuild network topology, but while newly having added a server or a server to occur to lose efficacy at every turn, only to monitoring server and with it adjacent those two data servers exert an influence, and essence only affects and is responsible for current this of monitoring and newly adds or the data server of failed server; For the amendment of monitoring routing table, also only need wherein two records of amendment, other records unaffected, upgrades expense very low.
(4) detection efficiency.No matter be that data server occurs in the situation of discrete inefficacy, or in the situation that server lost efficacy in flakes, collaborative monitoring mechanism all can effectively detect whole failed servers; Although it is longer to detect the time that all failed servers spend when in flakes data server lost efficacy, owing to occurring, the probability that data server in blocks lost efficacy is very low, therefore limited on the overall performance impact of system.
Brief description of the drawings
Fig. 1 is collaborative monitor network topological diagram.
Fig. 2 is the collaborative monitoring model figure of server.
Embodiment
Even current cloud computing system adopts the cloud data center of low-cost server, within a period of time, the ratio that the server that occurs to lose efficacy accounts for server sum remains lower, therefore reports " extremely " than reporting " normally " and obviously can reduce number of communications.The collaborative method for supervising of the server towards extensive cloud data center that the present invention proposes is followed the principle of " if data server does not send information to monitoring server on one's own initiative; this data server is defaulted as normally ", and data server is only occur just can be on one's own initiative to monitoring server transmission information when abnormal.Problem is to occur when abnormal when server self, self will lose the ability that sends abnormal information to main control server, therefore collaborative monitoring mechanism relies on mutual perception, the supervision each other between data server, and the abnormal information of server is to be reported to main control server by its neighbours' server.
1, unidirectional loop topological structure
First server cluster in cloud computing system forms a kind of unidirectional loop topological structure for collaborative monitoring, and server joins end to end, and monitor mode between server is unidirectional, as shown in Figure 1: DN 1before the server (PreNode) that continues be DN 8, DN 1follow-up server (PostNode) be DN 2, DN 1only be subject to DN 2supervision, DN 1only charge of overseeing DN 8state, if DN 1break down and lost efficacy, by DN 2be responsible for DN 1the information breaking down and lost efficacy is reported to monitoring server MN.
The collaborative method for supervising of the server towards extensive cloud data center that the present invention proposes comprises following link:
(1) in system newly add server and other server to set up before continue and follow-up relation, server selects to be responsible for the server of monitoring, and by which server monitoring, and before obtaining, continues and the information such as the IP address of follow-up server.For example DN 1know it before the server that continues be DN 8the server that then continues is DN 2, instead of other server.
(2) while inefficacy when server breaks down, original network topology is destroyed, and method is with the complete ring topology of low expense Fast Reconstruction.As work as DN 1while delaying machine, DN 2and DN 8set up rapidly collaborative monitoring relation, and in system with DN 1server without direct correlation is not affected.
(3) while inefficacy when server in blocks almost breaks down simultaneously, be that the server of certain server and this server of load monitoring was while all losing efficacy, monitoring server obtains all server fail messages fast, and the network topology of heavy damage is by Fast Reconstruction.For example, work as DN 1and DN 2while inefficacy simultaneously, DN 1and DN 2abnormal information is all reported to monitoring server fast.
2, collaborative monitoring model
In order to address the above problem, first need to build the collaborative monitoring model of server that is applicable to extensive cloud computing system, as shown in Figure 2, relate generally to following functions assembly:
(1) monitoring server.Monitoring server is responsible for the running status of Servers-all in supervisory control system, and the server state information of obtaining can offer the overall situation of cloud system keeper in order to quick grasp system; Collaborative monitoring relation between maintenance server, the normal operation of safeguards system; , as performer or ISP's etc. selection foundation, the information particularly server being lost efficacy sends to the modules such as task dispatcher in time to offer other module in system (as task dispatcher etc.).
(2) message router.Message router can be used as the core component that carries out information exchange between Servers-all in system, module; Adopt level message queue agreement, can safeguard one or more message queues.
(3) message queue.Each message queue is the signature buffering area being present on message router, and each message queue all has a unique queue identity QID, and each message all has theme Topic; Server can be bound with specific message queue; In collaborative supervisory control system, relate to two class message queues: heartbeat queue and global monitoring queue.Each server corresponding the jumping queue of uniting as one, server is periodically issued heartbeat message to own corresponding heartbeat queue, and other server can be from this queue subscribe message.In global monitoring queue whole system, only there is one, relate to the message bag of 2 kinds of themes: " login " message bag and " fault " message bag; Mark NID, IP address and heartbeat queue identity QID that " login " message comprises the server that sends this information; " fault " message handbag is containing the mark NID of the server of this information of transmission.
(4) monitoring routing table.Monitoring routing table is present on monitoring server, an one-way circulation chained list in essence, in table, each record can 5 tuples be described: (NID, IP, QID, PreNode, PreQID) wherein NID is server identification, and IP is the IP address of server, and QID is the mark of the heartbeat queue that server is corresponding, the server that continues before PreNode current server, PreQID refer to current server before the continue mark of heartbeat queue of server.Typical monitoring routing table is as shown in table 1.
Table 1 is monitored routing table
NID IP QID PreNode PreQID
DN 1 198.1.1.1 Q 1 DN m Q m
DN 2 198.1.1.2 Q 2 DN 1 Q 1
…… …… …… …… ……
DN i-1 198.1.1.(i-1) Q i-1 DN i-2 Q i-2
DN i 198.1.1.i Q i DN i-1 Q i-1
DN i+1 198.1.1.(i+1) Q i+1 DN i Q i
…… …… …… …… ……
DN m 198.1.1.m Q m DN m-1 Q m-1
(5) finger daemon.Finger daemon resides on each server, is responsible for representative server and other assembly and carries out alternately, and corresponding supervisory control system, is to be mainly responsible for data publish to arrive specific message queue, or subscribes to from specific message queue the message needing.
3, collaborative monitoring flow process when data server adds system
Collaborative monitoring flow process when data server adds system is as follows:
Step 1: first system judges that the current data server that adds system is to add first system or rejoin system; If data server is while adding system first, will be connected with message router by finger daemon, request message router is an independently heartbeat queue of this server establishment, then this server is periodically issued heartbeat message to this heartbeat queue, once there is N server to add system if having in system, no matter whether these servers are online at present, on message router, all will there is N heartbeat queue, in follow-up phase, server will periodically give out information to one's own heartbeat queue; If data server is to rejoin system, when message router has had this server to add system first, be the heartbeat queue of its establishment, needn't re-create, server is periodically issued heartbeat message to own corresponding heartbeat queue;
Step 2: data server is initiatively reported to monitoring server, concrete way is to issue a message bag that themes as " login " to message router, this message bag is by the global monitoring queue being received on message router;
Step 3: monitoring server has been subscribed to this global monitoring queue in the time of initialization, in the time that monitoring server obtains " login " message bag, from this message bag, extract at once the server info (NID of required monitoring, IP, QID), this information is inserted in the monitoring routing table of monitoring server local maintenance, the relative recording of the follow-up server to this server in monitoring routing table is modified, and relevant information is sent to corresponding server, to rebuild unidirectional loop network topology.
Concrete modification method is as follows:
Step (1): establish data server DN m+1add system, monitoring server is DN according to the information in its " login " message bag at monitoring routing table end m+1a newly-increased record, the simultaneously DN in modification list 1record corresponding information, as shown in table 2.Add data server DN m+1information mean DN m+1be inserted into DN 1and DN mbetween, therefore by original DN 1before the pointer (PreNode, PreQID) that continues (be server DN mserver identification (server DN mserver identification just use DN mrepresent, hereinafter the server identification of other server also adopts same method to represent) and corresponding heartbeat queue identity Q m) insert DN m+1before record, continue in pointer respective items, then revise DN 1before the pointer that continues be (DN m+1, Q m+1).
Table 2 increases DN newly m+1the monitoring routing table of record
NID IP QID PreNode PreQID
DN 1 198.1.1.1 Q 1 DN m+1 Q m+1
DN 2 198.1.1.2 Q 2 DN 1 Q 1
…… …… …… …… ……
DN i-1 198.1.1.(i-1) Q i-1 DN i-2 Q i-2
DN i 198.1.1.i Q i DN i-1 Q i-1
DN i+1 198.1.1.(i+1) Q i+1 DN i Q i
…… …… …… …… ……
DN m 198.1.1.m Q m DN m-1 Q m-1
DN m+1 198.1.1.(m+1) Q m+1 DN m Q m
Step (2): monitoring server will show in server DN m+1pointer (DN continues before in corresponding record m, Q m) according to DN m+1iP address send to again server DN m+1; Also by DN 1pointer information (DN continues before amended in corresponding record m+1, Q m+1) according to DN 1iP address send to again server DN 1;
Step (3): server DN m+1subscribe to and be designated Q to message router application mheartbeat queue, server DN 1subscribe to and be designated Q to message router application m+1heartbeat queue; Network topology has been rebuild, DN originally 1and DN mbetween monitoring relation be corrected for DN 1and DN m+1between, DN m+1and DN mbetween monitoring relation.
Step 4: monitoring server also needs DN m+1add the situation of system to notice to task dispatcher, follow-up while having new task again, task dispatcher can select allocating task to DN m.
4, collaborative monitoring flow process when individual data server lost efficacy
Single or sparse server is (with server DN i+1for example) the collaborative monitoring flow process that breaks down while inefficacy is as follows:
Step 1: if server DN i+1continuous several heart beat cycle (as continuous 3 cycles) is not from Q iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN i+1, then by finding DN in local monitoring routing table i+1monitored object (server before continues) is DN i, as shown in table 3.
Table 3DN imonitoring routing table after breaking down
Step 3: monitoring server is judged DN ifault, then upgrades monitoring routing table: server DN in first showing i(PreNode, PreQID) information in corresponding record (is server DN i-1server identification and corresponding heartbeat queue identity Q i-1) extract to upgrade DN i+1(PreNode, PreQID) information in corresponding record, then by DN icorresponding record is deleted;
Step 4: monitoring server is according to DN i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN i+1, server DN i+1subscribe to and be designated Q to message router application i-1heartbeat queue.Now DN i+1and DN i-1set up monitoring relation, unidirectional loop network topology has also been rebuild.
Step 5: monitoring server also needs DN ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN iunless, DN irecover normal presence and rejoin system.
5, collaborative monitoring flow process when data server lost efficacy in flakes
The continuous failure probability of server is very low in current cloud data center environment in flakes, but still is contingent, therefore needs under the circumstances.As shown in table 4, suppose current server DN iwith server DN i-1almost break down simultaneously and lost efficacy, this means now, being responsible for monitoring DN i-1dN ican not obtain DN by monitoring server by issuing " fault " message bag to message router i-1situation about losing efficacy.But DN i+1do not lose efficacy, therefore DN i+1possesses monitoring DN iability.
Table 4 server DN iand DN i-1monitoring routing table after losing efficacy continuously
NID IP QID PreNode PreQID
DN 1 198.1.1.1 Q 1 DN m+ 1 Q m+ 1
DN 2 198.1.1.2 Q 2 DN 1 Q 1
…… …… …… …… ……
DN i-2 198.1.1.(i-2) Q i-2 DN i-3 Q i-3
DN i-1 198.1.1.(i-1) Q i-1 DN i-2 Q i-2
DN i 198.1.1.i Q i DN i-1 Q i-1
DN i+1 198.1.1.(i+1) Q i+1 DN i Q i
…… …… …… …… ……
DN m 198.1.1.m Q m DN m-1 Q m-1
DN m+1 198.1.1.(m+1) Q m+1 DN m Q m
In flakes collaborative monitoring flow process when continuous inefficacy of server is as follows:
Step 1: at server DN i+1continuous several cycle is not from Q iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN i+1, then by finding DN in local monitoring routing table i+1monitored object is DN i, judge DN ifault;
Step 3: monitoring server upgrades monitoring routing table: server DN in first showing i(PreNode, PreQID) information in corresponding record (is server DN i-1server identification and corresponding heartbeat queue identity Q i-1) extract to upgrade DN i+1(PreNode, PreQID) information in corresponding record, then by DN icorresponding record is deleted;
Step 4: monitoring server is according to DN i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN i+1, server DN i+1subscribe to and be designated Q to message router application i-1heartbeat queue, DN i+1and DN i-1set up monitoring relation, as shown in table 5;
Table 5 is deleted failed server DN ithe monitoring routing table of record
Step 5: monitoring server also needs DN isituation about breaking down and lost efficacy is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN iunless, DN irecover normal presence and rejoin system.
Step 6: due to DN i-1also because fault had lost efficacy, therefore same, at server DN i+1to continuous several cycles not from Q i-1in the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once again, this message bag is by the global monitoring queue being received on message router;
Step 7: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN i+1, then by finding DN in local monitoring routing table i+1monitored object is DN i-1, judge DN i-1fault;
Step 8: monitoring server upgrades monitoring routing table: server DN in first showing i-1(PreNode, PreQID) information in corresponding record (is server DN i-2server identification and corresponding heartbeat queue identity Q i-2) extract to upgrade DN i+1(PreNode, PreQID) information in corresponding record, then by DN i-1corresponding record is deleted;
Step 9: monitoring server is according to DN i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN i+1, server DN i+1subscribe to and be designated Q to message router application i-2heartbeat queue, DN i+1and DN i-2set up monitoring relation, unidirectional loop network topology has also been rebuild again, as shown in table 6.
Table 6 is deleted failed server DN i-1the monitoring routing table of record
Step 10: monitoring server also needs DN i-1situation about breaking down and lost efficacy is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN i-1unless, DN i-1recover normal presence and rejoin system.
When more data server lost efficacy in flakes, adopt the flow process identical with step 6~step 9 to find successively all fail data servers.

Claims (5)

1. towards the collaborative method for supervising of server of extensive cloud data center, it is realized based on the collaborative monitoring model of server, and the critical piece of the collaborative monitoring model of server comprises monitoring server, message router, data server, message queue, monitoring routing table, finger daemon; The method of its collaborative monitoring comprises the steps:
Step 1: all data servers are connected successively and form unidirectional loop topological structure, server and follow-up server and be subject to follow-up server monitoring continue before each data server has, data server breaks down and while losing efficacy, is responsible for the failure conditions of data server to report monitoring server by the server that continues thereafter;
Step 2: the collaborative method for supervising of its server is in the time that data server adds system: re-establish the unidirectional loop topological structure that comprises this new data service device, monitoring server adds the situation of system to notice new data service device to task dispatcher;
When individual data server break down and while losing efficacy the collaborative method for supervising of its server be: the follow-up server of this data server be responsible for finding and by this situation report to monitoring server, re-establish the unidirectional loop topological structure of getting rid of this fault data server, monitoring server notices the situation of this data server fault to task dispatcher, and proceeds monitoring;
In the time that data server in blocks loses efficacy, the collaborative method for supervising of its server was: be responsible for finding successively that by follow-up first normal data server in these fail data servers in blocks the situation report also successively data server being lost efficacy is to monitoring server, re-establish the unidirectional loop topological structure of getting rid of this fault data server, monitoring server is noticed the situation of all data server faults to task dispatcher successively, and proceeds monitoring.
2. the collaborative method for supervising of a kind of server towards extensive cloud data center according to claim 1, in its step 2, in the time that data server adds system, the collaborative method for supervising of its server is as follows:
Step 1: first system judges that the current data server that adds system is to add first system or rejoin system; If data server is while adding system first, will be connected with message router by finger daemon, request message router is an independently heartbeat queue of this server establishment, then this server is periodically issued heartbeat message to this heartbeat queue, if once there be N data server to add system in system, no matter whether these servers are online at present, on message router, all will there is N heartbeat queue, in follow-up phase, server will periodically give out information to one's own heartbeat queue; If data server is to rejoin system, when message router has had this server to add system first, be the heartbeat queue of its establishment, needn't re-create, server is periodically issued heartbeat message to own corresponding heartbeat queue;
Step 2: data server is initiatively reported to monitoring server, concrete way is to issue a message bag that themes as " login " to message router, in the coated global monitoring queue of inserting on message router of this message;
Step 3: monitoring server has been subscribed to this global monitoring queue in the time of initialization, in the time that monitoring server obtains " login " message bag, from this message bag, extract at once the server info (NID of required monitoring, IP, QID), this information is inserted in the monitoring routing table of monitoring server local maintenance, the relative recording of the follow-up server to this server in monitoring routing table is modified, and relevant information is sent to this follow-up server, to rebuild unidirectional loop network topology.
Step 4: monitoring server also needs DN m+1add the situation of system to notice to task dispatcher, follow-up while having new task again, task dispatcher can select allocating task to DN m.
3. the collaborative method for supervising of a kind of server towards extensive cloud data center according to claim 2, the amending method in its step 3 is specially:
Step (1): establish data server DN m+1add system, monitoring server is DN according to the information in its " login " message bag at monitoring routing table end m+1a newly-increased record, the simultaneously DN in modification list 1record corresponding information, as shown in table 2; Add data server DN m+1information mean DN m+1be inserted into DN 1and DN mbetween, therefore by original DN 1before pointer (PreNode, the PreQID) information (DN that continues m, Q m) insert DN m+1before record, continue in pointer respective items, then revise DN 1before the pointer that continues be (DN m+1, Q m+1);
Step (2): monitoring server will show in server DN m+1pointer (DN continues before in corresponding record m, Q m) according to DN m+1iP address send to again server DN m+1; Also by DN 1pointer information (DN continues before amended in corresponding record m+1, Q m+1) according to DN 1iP address send to again server DN 1;
Step (3): server DN m+1subscribe to and be designated Q to message router application mheartbeat queue, server DN 1subscribe to and be designated Q to message router application m+1heartbeat queue; Network topology has been rebuild, DN originally 1and DN mbetween monitoring relation be corrected for DN 1and DN m+1between, DN m+1and DN mbetween monitoring relation.
4. the collaborative method for supervising of server towards extensive cloud data center according to one kind of claim 1, it is as follows that while inefficacy when individual data server breaks down in described step 2, its server is worked in coordination with method for supervising:
Step 1: if server DN i+1continuous several heart beat cycle is not from Q iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN i+1, then by finding DN in local monitoring routing table i+1the monitored object server that continues be DN i;
Step 3: monitoring server is judged DN ifault, then upgrades monitoring routing table: server DN in first showing i(PreNode, PreQID) information (DN in corresponding record i-1, Q i-1) extract to upgrade DN i+1(PreNode, PreQID) information in corresponding record, then by DN icorresponding record is deleted;
Step 4: monitoring server is according to DN i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN i+1, server DN i+1subscribe to and be designated Q to message router application i-1heartbeat queue, now DN i+1and DN i-1set up monitoring relation, unidirectional loop network topology has also been rebuild;
Step 5: monitoring server also needs DN ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN iunless, DN irecover normal presence and rejoin system.
5. the collaborative method for supervising of server towards extensive cloud data center according to one kind of claim 1, in step 2 in the time that data server in blocks lost efficacy its server to work in coordination with method for supervising as follows:
Step 1: at server DN i+1to continuous several cycles not from Q iin the heartbeat queue identifying, obtain message, issue a message bag that themes as " fault " to message router at once, this message bag is by the global monitoring queue being received on message router;
Step 2: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN i+1, then by finding DN in local monitoring routing table i+1monitored object is DN i, judge DN ifault;
Step 3: monitoring server upgrades monitoring routing table: server DN in first showing i(PreNode, PreQID) information (DN in corresponding record i-1, Q i-1) extract to upgrade DN i+1(PreNode, PreQID) information in corresponding record, then by DN icorresponding record is deleted;
Step 4: monitoring server is according to DN i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN i+1, server DN i+1subscribe to and be designated Q to message router application i-1heartbeat queue, DN i+1and DN i-1set up monitoring relation;
Step 5: monitoring server also needs DN ithe situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN iunless, DN irecover normal presence and rejoin system;
Step 6: due to DN i-1also because fault had lost efficacy, therefore same, at server DN i+1continuous several cycle is not from Q i-1while obtaining message in the heartbeat queue identifying, issue a message bag that themes as " fault " to message router at once again, this message bag is by the global monitoring queue being received on message router;
Step 7: when monitoring server obtains " fault " message bag in global monitoring queue, extracting from this message bag the server that sends fault message is DN i+1, then by finding DN in local monitoring routing table i+1monitored object is DN i-1, judge DN i-1fault;
Step 8: monitoring server upgrades monitoring routing table: server DN in first showing i-1(PreNode, PreQID) information (DN in corresponding record i-2, Q i-2) extract to upgrade DN i+1(PreNode, PreQID) information in corresponding record, then by DN i-1corresponding record is deleted;
Step 9: monitoring server is according to DN i+1iP address by upgrade after (PreNode, PreQID) information send to again server DN i+1, server DN i+1subscribe to and be designated Q to message router application i-2heartbeat queue, DN i+1and DN i-2set up monitoring relation, unidirectional loop network topology has also been rebuild again;
Step 10: monitoring server also needs DN i-1the situation of fault is noticed to task dispatcher, follow-up while having new task again, task dispatcher will be not can allocating task to DN i-1unless, DN i-1recover normal presence and rejoin system.
CN201410166275.2A 2014-04-23 2014-04-23 A kind of server cooperative monitoring method towards large-scale cloud data center Active CN103944784B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410166275.2A CN103944784B (en) 2014-04-23 2014-04-23 A kind of server cooperative monitoring method towards large-scale cloud data center

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410166275.2A CN103944784B (en) 2014-04-23 2014-04-23 A kind of server cooperative monitoring method towards large-scale cloud data center

Publications (2)

Publication Number Publication Date
CN103944784A true CN103944784A (en) 2014-07-23
CN103944784B CN103944784B (en) 2019-03-05

Family

ID=51192277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410166275.2A Active CN103944784B (en) 2014-04-23 2014-04-23 A kind of server cooperative monitoring method towards large-scale cloud data center

Country Status (1)

Country Link
CN (1) CN103944784B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107315660A (en) * 2017-06-29 2017-11-03 郑州云海信息技术有限公司 A kind of two-node cluster hot backup method of virtualization system, apparatus and system
CN107769953A (en) * 2016-08-23 2018-03-06 佛山市顺德区顺达电脑厂有限公司 Server failure detecting system
CN107924411A (en) * 2015-08-14 2018-04-17 甲骨文国际公司 The recovery of UI states in transaction system
CN109450988A (en) * 2018-10-19 2019-03-08 焦点科技股份有限公司 A method of data consistency is ensured under framework living more than strange land
CN110535939A (en) * 2019-08-29 2019-12-03 深圳前海环融联易信息科技服务有限公司 A kind of service discovery and method for pre-emptively, device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120198055A1 (en) * 2011-01-28 2012-08-02 Oracle International Corporation System and method for use with a data grid cluster to support death detection
CN102932210A (en) * 2012-11-23 2013-02-13 北京搜狐新媒体信息技术有限公司 Method and system for monitoring node in PaaS cloud platform
CN103152438A (en) * 2013-04-09 2013-06-12 上海理想信息产业(集团)有限公司 Method for obtaining business health degree under cloud computing environment
CN103731289A (en) * 2012-10-16 2014-04-16 无锡云捷科技有限公司 Method for automatic expansion of network server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120198055A1 (en) * 2011-01-28 2012-08-02 Oracle International Corporation System and method for use with a data grid cluster to support death detection
CN103731289A (en) * 2012-10-16 2014-04-16 无锡云捷科技有限公司 Method for automatic expansion of network server
CN102932210A (en) * 2012-11-23 2013-02-13 北京搜狐新媒体信息技术有限公司 Method and system for monitoring node in PaaS cloud platform
CN103152438A (en) * 2013-04-09 2013-06-12 上海理想信息产业(集团)有限公司 Method for obtaining business health degree under cloud computing environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐小龙 等: "《一种基于多移动agent的分布式恶意进程协同识别机制》", 《计算机应用研究》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107924411A (en) * 2015-08-14 2018-04-17 甲骨文国际公司 The recovery of UI states in transaction system
CN107924411B (en) * 2015-08-14 2023-04-21 甲骨文国际公司 Method and system for recovering UI state in transaction system
CN107769953A (en) * 2016-08-23 2018-03-06 佛山市顺德区顺达电脑厂有限公司 Server failure detecting system
CN107315660A (en) * 2017-06-29 2017-11-03 郑州云海信息技术有限公司 A kind of two-node cluster hot backup method of virtualization system, apparatus and system
CN109450988A (en) * 2018-10-19 2019-03-08 焦点科技股份有限公司 A method of data consistency is ensured under framework living more than strange land
CN109450988B (en) * 2018-10-19 2020-07-31 焦点科技股份有限公司 Method for guaranteeing data consistency under remote multi-active architecture
CN110535939A (en) * 2019-08-29 2019-12-03 深圳前海环融联易信息科技服务有限公司 A kind of service discovery and method for pre-emptively, device, computer equipment and storage medium
CN110535939B (en) * 2019-08-29 2022-02-11 深圳前海环融联易信息科技服务有限公司 Service discovery and preemption method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN103944784B (en) 2019-03-05

Similar Documents

Publication Publication Date Title
US10169709B2 (en) Avoiding incompatibility between data and computing processes to enhance computer performance
CN103853627B (en) By the method and system relatively analyzing virtual machine performance issue reason with physical machine
CN103944784A (en) Large-scale-cloud-data-center-oriented server cooperative monitoring method
CN104199957B (en) A kind of implementation method of Redis general-purpose proxies
CN101753597B (en) Keeping alive method between peer node and client under peer node-client architecture
CN103259832A (en) Cluster resource control method for achieving dynamic load balance, fault diagnosis and failover
WO2017080161A1 (en) Alarm information processing method and device in cloud computing
CN106656682A (en) Method, system and device for detecting cluster heartbeat
CN104618221A (en) Decentralized message service system
JP2016046736A (en) Service chaining system, service chaining forwarder device, and service chaining method
JP6176734B2 (en) Virtual machine placement determination apparatus, method and program thereof
CN109412890B (en) DDS-based joint test platform middleware node state detection method
CN103152420B (en) A kind of method avoiding single-point-of-failofe ofe Ovirt virtual management platform
CN110046064B (en) Cloud server disaster tolerance implementation method based on fault drift
CN110365537A (en) Middleware business fault treatment method and system
CN114064217A (en) Node virtual machine migration method and device based on OpenStack
CN104243473B (en) A kind of method and device of data transmission
Duan et al. Reliable communication models in interdependent critical infrastructure networks
Abid et al. A novel scheme for node failure recovery in virtualized networks
CN111880932A (en) Data storage method and device based on multiple network ports
Pashkov et al. On high availability distributed control plane for software-defined networks
CN102833093B (en) Network failure processing method, Apparatus and system
JP5711772B2 (en) Cluster system
CN104516790A (en) System and method for recording and recovering checking point in distributed environment
CN113010337B (en) Fault detection method, master control node, working node and distributed system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: No. 66, New Model Road, Gulou District, Nanjing City, Jiangsu Province, 210000

Applicant after: Nanjing Post & Telecommunication Univ.

Address before: 210046 9 Wen Yuan Road, Ya Dong new town, Qixia District, Nanjing, Jiangsu.

Applicant before: Nanjing Post & Telecommunication Univ.

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20191223

Address after: 264200 no.17-2, Xinwei Road, Huancui District, Weihai City, Shandong Province

Patentee after: Weihai Blue Ocean Bank Co., Ltd

Address before: 210000, 66 new model street, Gulou District, Jiangsu, Nanjing

Patentee before: Nanjing Post & Telecommunication Univ.