CN105812159A - Cloud platform monitoring alarm device - Google Patents

Cloud platform monitoring alarm device Download PDF

Info

Publication number
CN105812159A
CN105812159A CN201410841470.0A CN201410841470A CN105812159A CN 105812159 A CN105812159 A CN 105812159A CN 201410841470 A CN201410841470 A CN 201410841470A CN 105812159 A CN105812159 A CN 105812159A
Authority
CN
China
Prior art keywords
node
monitoring alarm
master node
task
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410841470.0A
Other languages
Chinese (zh)
Other versions
CN105812159B (en
Inventor
刘冬
喻之斌
贝振东
须成忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201410841470.0A priority Critical patent/CN105812159B/en
Publication of CN105812159A publication Critical patent/CN105812159A/en
Application granted granted Critical
Publication of CN105812159B publication Critical patent/CN105812159B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to a cloud platform monitoring alarm device, comprising steps of, for each node, broadcasting an operation state of the node, receiving the operation states broadcasted by other nodes, periodically updating all node monitoring alarm service operation state lists of local maintenance, determining a master node according to the monitoring alarm service operation state list, performing the master node duty by the determined master node, and maintaining a monitoring alarm task information queue through an information queue protocol. The realization scheme of the cloud platform monitoring alarm device is simple and reliable, good in expandability, high usability and error tolerance.

Description

A kind of cloud platform monitoring alarm method
Technical field
The present invention relates to a kind of cloud platform monitoring alarm method.
Background technology
Full-fledged and the grid of virtual technology is in connection, is pregnant with cloud computing platform.The resource pool that cloud computing platform is mutually shared huge infrastructure, data storage, various platform, software composition, cooperated, and take out stratification service on this basis, in the way of use of paying, provide the user the such as service such as architecture (laaS), platform (Paas), software (Saas).
Monitoring is the important component part of cloud computing platform, it is the premise of a lot of such as analysis of network, system administration, job scheduling, load balancing, event prediction, fault detect and recovery operations in cloud computing platform, cloud computing platform dynamic quantization resource can be helped to use, detect service deficiency, find that user uses pattern, auxiliary resources scheduler module decision-making, the service quality improving cloud computing platform is played a significant role.Cloud computing platform not only includes the storage of bottom, network, calculating resource, is additionally included in the virtual resource on this basis and the cloud platform after the integration of these Resource Abstractizations.How they are carried out efficient dynamically monitoring by that be flooded with under various substantial amounts of distributed environment isomery above cloud platform, dynamic, complicated resource and management is to provide the guarantee of high-quality service.
The famous cloud computing platform of current industry has the monitoring solution of oneself, realizes the monitoring alarm to cloud computing platform.But these schemes often all adopt one-site model, namely node is responsible for the monitoring of whole cloud computing platform, and along with the continuous extension of cloud platform, the pressure of monitoring also can be increasing, and extensibility and fault-tolerance are also poor, it is difficult to ensure that the monitoring alarm task of cloud platform efficiently performs.Therefore, the monitoring and alarming system of cloud platform not only to realize monitor task efficiently, also should be with good expansibility, high availability and fault-tolerance.
Summary of the invention
In view of this, it is necessary to a kind of cloud platform monitoring alarm method is provided.
The present invention also provides for a kind of cloud platform monitoring alarm method, it is characterized in that, the method comprises the steps: that a. is for each node: broadcast the running status of this node, receive the running status of other node broadcasts, periodically update all monitoring nodes alert service running status lists of local maintenance;B. according to described monitoring alarm service operation status list, it is determined that master node;C. the master node being confirmed as fulfils the responsibility of master node;D. monitoring alarm task message queue is safeguarded by message queue protocol.
Wherein, the method also includes step e: when master node redefines master node after breaking down and distribute task.
Described monitoring alarm service operation status list includes: the running state information of all nodes and temporal information.
Described step b specifically includes: according to the running state information in described monitoring alarm service operation status list and temporal information, it is judged that whether this node is the node started the earliest in the normal node of current operating conditions;If the node started the earliest in this node normal node that is current operating conditions, then send broadcast message, notify that all nodes are served as master node by this node.
Described step c specifically includes: after having some new monitoring alarm tasks to be created, and monitoring alarm task new described in master node mean allocation gives all nodes;New when adding from node or when there being node to be deleted when having, master node redistributes all of monitoring alarm task to each node.
Described after having some new monitoring alarm tasks to be created, monitoring alarm task new described in master node mean allocation specifically includes to all nodes: after having some new monitoring alarm tasks to be created, being submitted in monitoring alarm message queue, master node obtains new mission bit stream from message queue;Master node applies rational algorithm distribution monitoring alarm task to each node, sends inter-related task ID to each from node, it is achieved load balancing.
Described new when adding from node when having, master node is redistributed all of monitoring alarm task and is specifically included to each node: master node determines whether newly added node according to running status and the report time of each node in all node running status lists;If there being newly added node, then master node visit safeguards the message queue of monitoring alarm task, obtains new monitoring alarm task;Master node is applied rational algorithm and is assigned tasks to newly added node, it is achieved load balancing.
Described when there being node to be deleted, master node is redistributed all of monitoring alarm task and is specifically included to each node: according to running status and the report time of each node in all node running status lists, master node determines whether that node is deleted;If there being node to be deleted, then master node visit safeguards the message queue of monitoring alarm task, obtains the monitoring alarm task that this node is responsible for;Master node is applied rational algorithm and is assigned tasks to the node of all normal operations, it is achieved load balancing.
Described step d specifically includes: all nodes all may have access to described monitoring alarm task message queue, therefrom obtains monitoring alarm task;Newly created monitoring alarm task is inserted into the afterbody of described monitoring alarm task message queue;Executed monitoring alarm task is deleted from described monitoring alarm task message queue.
Described step e specifically includes: if the time period of other each nodes state updating information that cannot receive current master node exceedes setting, then judge that current master node running status goes wrong;The monitoring alarm service operation status list of all nodal test local maintenances, selects the node currently started the earliest as new master node;New master node reads monitoring alarm task message queue, obtains current all monitoring alarm tasks, performs described monitoring alarm task reasonable distribution again to all nodes.
One cloud platform monitoring alarm method of the present invention, adopts distributed structure/architecture, it is achieved scheme is simple and reliable, it is possible to efficiently completes huge cloud platform monitoring alarm task, and is with good expansibility and high availability and good fault-tolerance.
Accompanying drawing explanation
Fig. 1 is the running environment schematic diagram of cloud platform monitoring alarm method of the present invention;
Fig. 2 is the flow chart of cloud platform monitoring alarm method of the present invention;
Fig. 3 be in step S3 of the present invention when there being some new monitoring alarm tasks to be created the operation process chart of preferred embodiment;
Fig. 4 is when there being the new operation process chart of preferred embodiment when adding from node in step S3 of the present invention;
Fig. 5 be in step S3 of the present invention when have node be deleted time preferred embodiment operation process chart.
Detailed description of the invention
Below in conjunction with drawings and the specific embodiments, the present invention is further detailed explanation.
Consult shown in Fig. 1, be the running environment schematic diagram of cloud platform monitoring alarm method of the present invention.
The running environment of described cloud platform monitoring alarm method adopts master-slave architecture, including: master node and several are from node.Operation monitoring alert service process on each node, completes the monitoring alarm task of cloud platform jointly.Master node and between node and between node by remote procedure call protocol (RemoteProcedureCallProtocol, RPC) and message queue protocol (AdvancedMessageQueuingProtocol, AMQP) realize information communication.
Consult shown in Fig. 2, be the operation process chart of cloud platform monitoring alarm method preferred embodiment of the present invention.
Step S1, the monitoring alarm service of each node, each through remote procedure call protocol, the running status of this node of periodic broadcast, informs that whether the operation of other this node of node is normal.Meanwhile, each node receives the running status of other node broadcasts, periodically updates all monitoring nodes alert service running status lists of local maintenance.Specifically:
Whether the monitoring alarm service of each node is each through remote procedure call protocol, and the running status of this node of periodic broadcast is to all nodes, and the running status of described broadcast includes instant temporal information, normal to inform other this node of node operation this moment.
The monitoring alarm service of each node, each through remote procedure call protocol, receives the running status of other node broadcasts, periodically updates the local all monitoring nodes alert service running status lists safeguarded.Running state information that all nodes of described list records are instant and temporal information.
Step S2, the monitoring alarm seeervice cycle property of each node detects each node running status list safeguarded, it is judged that whether this node is the node currently started the earliest, if it is, send broadcast message, notifies that all nodes are served as master node by this node.Specifically:
The node running status list that monitoring alarm seeervice cycle property detection this locality of each node is safeguarded, according to the running state information in list and temporal information, it is judged that whether this node is the node started the earliest in the normal node of current operating conditions;
If the node started the earliest in this node normal node that is current operating conditions, then send broadcast message, notify that all nodes are served as master node by this node.
Step S3, after a node is confirmed as master node, will fulfil the responsibility of master node.The main task of master node is exactly reasonable distribution monitoring alarm task to each from node.After having some new monitoring alarm tasks to be created, monitoring alarm task new described in master node mean allocation gives all nodes.New when adding from node or when there being node to be deleted when having, master node redistributes all of monitoring alarm task to each node, reaches the load balance of task.
Step S4, safeguards a monitoring alarm task message queue by message queue protocol, comprises all current monitoring alarm tasks in described message queue.Specifically:
Monitoring alarm task message queue maintenance is all monitoring alarm tasks being carrying out currently.All nodes all may have access to this message queue, therefrom obtains monitoring alarm task;
It is created when there being new monitoring alarm task, it will be inserted into the afterbody of message queue;
When monitoring alarm task is performed, it will delete from message queue.
Step S5, after master node breaks down, owing to the running status of local monitor warning task cannot be broadcasted, when all node running status lists of other nodal test local maintenances, the node currently started the earliest can be selected as master node, and obtain all tasks from monitoring alarm message queue, it is reassigned to all nodes.Specifically:
When causing current master node to break down for some reason, owing to cannot broadcast the running status of local monitor warning task, therefore other each nodes cannot receive the state updating information of current master node.After this time period exceedes setting, namely can determine whether that current master node running status goes wrong;
The monitoring alarm service operation status list of all nodal test local maintenances, selects the node currently started the earliest as new master node;
New master node reads monitoring alarm task message queue, obtains current all monitoring alarm tasks, performs described monitoring alarm task reasonable distribution again to all nodes.
Consulting shown in Fig. 3, be in cloud platform monitoring alarm method step S3 of the present invention after having some new monitoring alarm tasks to be created, master node distributes described new monitoring alarm task to the operation process chart from the preferred embodiment of node.
Step S311, master node periodically monitors the message queue of monitoring alarm task.
Step S312, after having some new monitoring alarm tasks to be created, is submitted in monitoring alarm message queue, and master node obtains new mission bit stream from message queue.
Step S313, master node applies rational algorithm distribution monitoring alarm task to each node, sends inter-related task ID to each from node, it is achieved load balancing.
Consult shown in Fig. 4, be when there being the new operation process chart of the processing method preferred embodiment of master node when adding from node in cloud platform monitoring alarm method step S3 of the present invention.
In step S321, master node periodically safeguards the list updating local all node running statuses.
In step S322, master node determines whether newly added node according to running status and the report time of each node in all node running status lists.
In step S323, if there being newly added node, then master node visit safeguards the message queue of monitoring alarm task, obtains new monitoring alarm task.
In step S324, master node is applied rational algorithm and is assigned tasks to newly added node, it is achieved load balancing.
Consult shown in Fig. 5, be in cloud platform monitoring alarm method step S3 of the present invention when there being node to be deleted the operation process chart of processing method preferred embodiment of master node.
In step S331, master node periodically safeguards the list updating local all node running statuses.
In step S332, according to running status and the report time of each node in all node running status lists, master node determines whether that node is deleted.
In step S333, if there being node to be deleted, then master node visit safeguards the message queue of monitoring alarm task, obtains the monitoring alarm task that this node is responsible for.
In step S334, master node is applied rational algorithm and is assigned tasks to the node of all normal operations, it is achieved load balancing.
Although the present invention is described with reference to current better embodiment; but skilled persons will appreciate that; above-mentioned better embodiment is only used for the present invention is described; not it is used for limiting protection scope of the present invention; any within the spirit and principles in the present invention scope; any modification of being done, equivalence replacement, improvement etc., should be included within the scope of the present invention.

Claims (10)

1. a cloud platform monitoring alarm method, it is characterised in that the method comprises the steps:
A. for each node: broadcast the running status of this node, receive the running status of other node broadcasts, periodically update all monitoring nodes alert service running status lists of local maintenance;
B. according to described monitoring alarm service operation status list, it is determined that master node;
C. the master node being confirmed as fulfils the responsibility of master node;
D. monitoring alarm task message queue is safeguarded by message queue protocol.
2. the method for claim 1, it is characterised in that the method also includes step e:
When master node redefines master node after breaking down and distribute task.
3. the method for claim 1, it is characterised in that described monitoring alarm service operation status list includes: the running state information of all nodes and temporal information.
4. method as claimed in claim 3, it is characterised in that described step b specifically includes:
According to the running state information in described monitoring alarm service operation status list and temporal information, it is judged that whether this node is the node started the earliest in the normal node of current operating conditions;
If the node started the earliest in this node normal node that is current operating conditions, then send broadcast message, notify that all nodes are served as master node by this node.
5. the method for claim 1, it is characterised in that described step c specifically includes:
After having some new monitoring alarm tasks to be created, monitoring alarm task new described in master node mean allocation gives all nodes;
New when adding from node or when there being node to be deleted when having, master node redistributes all of monitoring alarm task to each node.
6. method as claimed in claim 5, it is characterised in that described after having some new monitoring alarm tasks to be created, monitoring alarm task new described in master node mean allocation specifically includes to all nodes:
After having some new monitoring alarm tasks to be created, being submitted in monitoring alarm message queue, master node obtains new mission bit stream from message queue;
Master node applies rational algorithm distribution monitoring alarm task to each node, sends inter-related task ID to each from node, it is achieved load balancing.
7. method as claimed in claim 5, it is characterised in that described new when adding from node when having, master node is redistributed all of monitoring alarm task and specifically included to each node:
Master node determines whether newly added node according to running status and the report time of each node in all node running status lists;
If there being newly added node, then master node visit safeguards the message queue of monitoring alarm task, obtains new monitoring alarm task;
Master node is applied rational algorithm and is assigned tasks to newly added node, it is achieved load balancing.
8. method as claimed in claim 5, it is characterised in that described when there being node to be deleted, master node is redistributed all of monitoring alarm task and specifically included to each node:
According to running status and the report time of each node in all node running status lists, master node determines whether that node is deleted;
If there being node to be deleted, then master node visit safeguards the message queue of monitoring alarm task, obtains the monitoring alarm task that this node is responsible for;
Master node is applied rational algorithm and is assigned tasks to the node of all normal operations, it is achieved load balancing.
9. the method for claim 1, it is characterised in that described step d specifically includes:
All nodes all may have access to described monitoring alarm task message queue, therefrom obtains monitoring alarm task;
Newly created monitoring alarm task is inserted into the afterbody of described monitoring alarm task message queue;
Executed monitoring alarm task is deleted from described monitoring alarm task message queue.
10. method as claimed in claim 2, it is characterised in that described step e specifically includes:
If the time period of the state updating information that other each nodes cannot receive current master node exceedes setting, then judge that current master node running status goes wrong;
The monitoring alarm service operation status list of all nodal test local maintenances, selects the node currently started the earliest as new master node;
New master node reads monitoring alarm task message queue, obtains current all monitoring alarm tasks, performs described monitoring alarm task reasonable distribution again to all nodes.
CN201410841470.0A 2014-12-30 2014-12-30 A kind of cloud platform monitoring alarm method Active CN105812159B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410841470.0A CN105812159B (en) 2014-12-30 2014-12-30 A kind of cloud platform monitoring alarm method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410841470.0A CN105812159B (en) 2014-12-30 2014-12-30 A kind of cloud platform monitoring alarm method

Publications (2)

Publication Number Publication Date
CN105812159A true CN105812159A (en) 2016-07-27
CN105812159B CN105812159B (en) 2019-06-04

Family

ID=56980157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410841470.0A Active CN105812159B (en) 2014-12-30 2014-12-30 A kind of cloud platform monitoring alarm method

Country Status (1)

Country Link
CN (1) CN105812159B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107395458A (en) * 2017-07-31 2017-11-24 东软集团股份有限公司 system monitoring method and device
CN107608285A (en) * 2017-09-01 2018-01-19 北京南凯自动化系统工程有限公司 A kind of comprehensive monitoring system
CN109144737A (en) * 2018-10-09 2019-01-04 郑州云海信息技术有限公司 Controller management method, apparatus and storage medium in a kind of distributed cluster system
CN112685199A (en) * 2020-12-30 2021-04-20 平安普惠企业管理有限公司 Message queue repairing method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1512729A (en) * 2002-12-31 2004-07-14 联想(北京)有限公司 Method for network equipment self adaption load equalization
CN101924650A (en) * 2010-08-04 2010-12-22 浙江省电力公司 Method for implementing services and intelligent server autonomy of failure information system
CN102882909A (en) * 2011-07-15 2013-01-16 易云捷讯科技(北京)有限公司 Cloud computing service monitoring system and method thereof
US20140032766A1 (en) * 2002-05-10 2014-01-30 Silicon Graphics International Corp. Real-time storage area network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032766A1 (en) * 2002-05-10 2014-01-30 Silicon Graphics International Corp. Real-time storage area network
CN1512729A (en) * 2002-12-31 2004-07-14 联想(北京)有限公司 Method for network equipment self adaption load equalization
CN101924650A (en) * 2010-08-04 2010-12-22 浙江省电力公司 Method for implementing services and intelligent server autonomy of failure information system
CN102882909A (en) * 2011-07-15 2013-01-16 易云捷讯科技(北京)有限公司 Cloud computing service monitoring system and method thereof

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107395458A (en) * 2017-07-31 2017-11-24 东软集团股份有限公司 system monitoring method and device
CN107395458B (en) * 2017-07-31 2020-05-22 东软集团股份有限公司 System monitoring method and device
CN107608285A (en) * 2017-09-01 2018-01-19 北京南凯自动化系统工程有限公司 A kind of comprehensive monitoring system
CN107608285B (en) * 2017-09-01 2019-10-08 北京南凯自动化系统工程有限公司 A kind of comprehensive monitoring system
CN109144737A (en) * 2018-10-09 2019-01-04 郑州云海信息技术有限公司 Controller management method, apparatus and storage medium in a kind of distributed cluster system
CN112685199A (en) * 2020-12-30 2021-04-20 平安普惠企业管理有限公司 Message queue repairing method and device, computer equipment and storage medium
CN112685199B (en) * 2020-12-30 2023-10-20 董小君 Message queue repairing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN105812159B (en) 2019-06-04

Similar Documents

Publication Publication Date Title
CN104461752B (en) A kind of multimedia distributed task processing method of two-stage failure tolerant
CN102932210B (en) Method and system for monitoring node in PaaS cloud platform
CN107959705B (en) Distribution method of streaming computing task and control server
CN104486445A (en) Distributed extendable resource monitoring system and method based on cloud platform
CN105471960A (en) Information interaction system and method between private clouds and public cloud
CN105703940A (en) Multistage dispatching distributed parallel computing-oriented monitoring system and monitoring method
CN105812159A (en) Cloud platform monitoring alarm device
CN105630589A (en) Distributed process scheduling system and process scheduling and execution method
CN109240820B (en) Image processing task processing method and device, electronic equipment and storage medium
CN108881379B (en) Method and device for data synchronization between server clusters
WO2013016977A1 (en) Method and system for uniformly scheduling remote resources of cloud computing
EP4172768A1 (en) Rightsizing virtual machine deployments in a cloud computing environment
CN111988347B (en) Data processing method of board hopping machine system and board hopping machine system
CN113778615B (en) Rapid and stable network shooting range virtual machine construction system
CN104320433A (en) Data processing method and distributed data processing system
CN104484228A (en) Distributed parallel task processing system based on Intelli-DSC (Intelligence-Data Service Center)
CN103793296A (en) Method for assisting in backing-up and copying computer system in cluster
CN113326100A (en) Cluster management method, device and equipment and computer storage medium
JPWO2014050493A1 (en) Standby system apparatus, operational system apparatus, redundant configuration system, and load distribution method
US10684875B2 (en) Synchronization of a virtual machine across mobile devices
CN111614702B (en) Edge calculation method and edge calculation system
CN115391058B (en) SDN-based resource event processing method, resource creation method and system
CN116257380A (en) High availability method and system for Kubernetes federal management control plane across data centers
CN104486447A (en) Large platform cluster system based on Big-Cluster
CN113890850B (en) Route disaster recovery system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant