CN105812159A

CN105812159A - Cloud platform monitoring alarm device

Info

Publication number: CN105812159A
Application number: CN201410841470.0A
Authority: CN
Inventors: 刘冬; 喻之斌; 贝振东; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2014-12-30
Filing date: 2014-12-30
Publication date: 2016-07-27
Anticipated expiration: 2034-12-30
Also published as: CN105812159B

Abstract

The invention relates to a cloud platform monitoring alarm device, comprising steps of, for each node, broadcasting an operation state of the node, receiving the operation states broadcasted by other nodes, periodically updating all node monitoring alarm service operation state lists of local maintenance, determining a master node according to the monitoring alarm service operation state list, performing the master node duty by the determined master node, and maintaining a monitoring alarm task information queue through an information queue protocol. The realization scheme of the cloud platform monitoring alarm device is simple and reliable, good in expandability, high usability and error tolerance.

Description

A kind of cloud platform monitoring alarm method

Technical field

The present invention relates to a kind of cloud platform monitoring alarm method.

Background technology

Full-fledged and the grid of virtual technology is in connection, is pregnant with cloud computing platform.The resource pool that cloud computing platform is mutually shared huge infrastructure, data storage, various platform, software composition, cooperated, and take out stratification service on this basis, in the way of use of paying, provide the user the such as service such as architecture (laaS), platform (Paas), software (Saas).

Monitoring is the important component part of cloud computing platform, it is the premise of a lot of such as analysis of network, system administration, job scheduling, load balancing, event prediction, fault detect and recovery operations in cloud computing platform, cloud computing platform dynamic quantization resource can be helped to use, detect service deficiency, find that user uses pattern, auxiliary resources scheduler module decision-making, the service quality improving cloud computing platform is played a significant role.Cloud computing platform not only includes the storage of bottom, network, calculating resource, is additionally included in the virtual resource on this basis and the cloud platform after the integration of these Resource Abstractizations.How they are carried out efficient dynamically monitoring by that be flooded with under various substantial amounts of distributed environment isomery above cloud platform, dynamic, complicated resource and management is to provide the guarantee of high-quality service.

The famous cloud computing platform of current industry has the monitoring solution of oneself, realizes the monitoring alarm to cloud computing platform.But these schemes often all adopt one-site model, namely node is responsible for the monitoring of whole cloud computing platform, and along with the continuous extension of cloud platform, the pressure of monitoring also can be increasing, and extensibility and fault-tolerance are also poor, it is difficult to ensure that the monitoring alarm task of cloud platform efficiently performs.Therefore, the monitoring and alarming system of cloud platform not only to realize monitor task efficiently, also should be with good expansibility, high availability and fault-tolerance.

Summary of the invention

In view of this, it is necessary to a kind of cloud platform monitoring alarm method is provided.

The present invention also provides for a kind of cloud platform monitoring alarm method, it is characterized in that, the method comprises the steps: that a. is for each node: broadcast the running status of this node, receive the running status of other node broadcasts, periodically update all monitoring nodes alert service running status lists of local maintenance；B. according to described monitoring alarm service operation status list, it is determined that master node；C. the master node being confirmed as fulfils the responsibility of master node；D. monitoring alarm task message queue is safeguarded by message queue protocol.

Wherein, the method also includes step e: when master node redefines master node after breaking down and distribute task.

Described monitoring alarm service operation status list includes: the running state information of all nodes and temporal information.

Described step b specifically includes: according to the running state information in described monitoring alarm service operation status list and temporal information, it is judged that whether this node is the node started the earliest in the normal node of current operating conditions；If the node started the earliest in this node normal node that is current operating conditions, then send broadcast message, notify that all nodes are served as master node by this node.

Described step c specifically includes: after having some new monitoring alarm tasks to be created, and monitoring alarm task new described in master node mean allocation gives all nodes；New when adding from node or when there being node to be deleted when having, master node redistributes all of monitoring alarm task to each node.

Described after having some new monitoring alarm tasks to be created, monitoring alarm task new described in master node mean allocation specifically includes to all nodes: after having some new monitoring alarm tasks to be created, being submitted in monitoring alarm message queue, master node obtains new mission bit stream from message queue；Master node applies rational algorithm distribution monitoring alarm task to each node, sends inter-related task ID to each from node, it is achieved load balancing.

Described new when adding from node when having, master node is redistributed all of monitoring alarm task and is specifically included to each node: master node determines whether newly added node according to running status and the report time of each node in all node running status lists；If there being newly added node, then master node visit safeguards the message queue of monitoring alarm task, obtains new monitoring alarm task；Master node is applied rational algorithm and is assigned tasks to newly added node, it is achieved load balancing.

Described when there being node to be deleted, master node is redistributed all of monitoring alarm task and is specifically included to each node: according to running status and the report time of each node in all node running status lists, master node determines whether that node is deleted；If there being node to be deleted, then master node visit safeguards the message queue of monitoring alarm task, obtains the monitoring alarm task that this node is responsible for；Master node is applied rational algorithm and is assigned tasks to the node of all normal operations, it is achieved load balancing.

Described step d specifically includes: all nodes all may have access to described monitoring alarm task message queue, therefrom obtains monitoring alarm task；Newly created monitoring alarm task is inserted into the afterbody of described monitoring alarm task message queue；Executed monitoring alarm task is deleted from described monitoring alarm task message queue.

Described step e specifically includes: if the time period of other each nodes state updating information that cannot receive current master node exceedes setting, then judge that current master node running status goes wrong；The monitoring alarm service operation status list of all nodal test local maintenances, selects the node currently started the earliest as new master node；New master node reads monitoring alarm task message queue, obtains current all monitoring alarm tasks, performs described monitoring alarm task reasonable distribution again to all nodes.

One cloud platform monitoring alarm method of the present invention, adopts distributed structure/architecture, it is achieved scheme is simple and reliable, it is possible to efficiently completes huge cloud platform monitoring alarm task, and is with good expansibility and high availability and good fault-tolerance.

Accompanying drawing explanation

Fig. 1 is the running environment schematic diagram of cloud platform monitoring alarm method of the present invention；

Fig. 2 is the flow chart of cloud platform monitoring alarm method of the present invention；

Fig. 3 be in step S3 of the present invention when there being some new monitoring alarm tasks to be created the operation process chart of preferred embodiment；

Fig. 4 is when there being the new operation process chart of preferred embodiment when adding from node in step S3 of the present invention；

Fig. 5 be in step S3 of the present invention when have node be deleted time preferred embodiment operation process chart.

Detailed description of the invention

Below in conjunction with drawings and the specific embodiments, the present invention is further detailed explanation.

Consult shown in Fig. 1, be the running environment schematic diagram of cloud platform monitoring alarm method of the present invention.

The running environment of described cloud platform monitoring alarm method adopts master-slave architecture, including: master node and several are from node.Operation monitoring alert service process on each node, completes the monitoring alarm task of cloud platform jointly.Master node and between node and between node by remote procedure call protocol (RemoteProcedureCallProtocol, RPC) and message queue protocol (AdvancedMessageQueuingProtocol, AMQP) realize information communication.

Consult shown in Fig. 2, be the operation process chart of cloud platform monitoring alarm method preferred embodiment of the present invention.

Step S1, the monitoring alarm service of each node, each through remote procedure call protocol, the running status of this node of periodic broadcast, informs that whether the operation of other this node of node is normal.Meanwhile, each node receives the running status of other node broadcasts, periodically updates all monitoring nodes alert service running status lists of local maintenance.Specifically:

Whether the monitoring alarm service of each node is each through remote procedure call protocol, and the running status of this node of periodic broadcast is to all nodes, and the running status of described broadcast includes instant temporal information, normal to inform other this node of node operation this moment.

The monitoring alarm service of each node, each through remote procedure call protocol, receives the running status of other node broadcasts, periodically updates the local all monitoring nodes alert service running status lists safeguarded.Running state information that all nodes of described list records are instant and temporal information.

Step S2, the monitoring alarm seeervice cycle property of each node detects each node running status list safeguarded, it is judged that whether this node is the node currently started the earliest, if it is, send broadcast message, notifies that all nodes are served as master node by this node.Specifically:

The node running status list that monitoring alarm seeervice cycle property detection this locality of each node is safeguarded, according to the running state information in list and temporal information, it is judged that whether this node is the node started the earliest in the normal node of current operating conditions；

If the node started the earliest in this node normal node that is current operating conditions, then send broadcast message, notify that all nodes are served as master node by this node.

Step S3, after a node is confirmed as master node, will fulfil the responsibility of master node.The main task of master node is exactly reasonable distribution monitoring alarm task to each from node.After having some new monitoring alarm tasks to be created, monitoring alarm task new described in master node mean allocation gives all nodes.New when adding from node or when there being node to be deleted when having, master node redistributes all of monitoring alarm task to each node, reaches the load balance of task.

Step S4, safeguards a monitoring alarm task message queue by message queue protocol, comprises all current monitoring alarm tasks in described message queue.Specifically:

Monitoring alarm task message queue maintenance is all monitoring alarm tasks being carrying out currently.All nodes all may have access to this message queue, therefrom obtains monitoring alarm task；

It is created when there being new monitoring alarm task, it will be inserted into the afterbody of message queue；

When monitoring alarm task is performed, it will delete from message queue.

Step S5, after master node breaks down, owing to the running status of local monitor warning task cannot be broadcasted, when all node running status lists of other nodal test local maintenances, the node currently started the earliest can be selected as master node, and obtain all tasks from monitoring alarm message queue, it is reassigned to all nodes.Specifically:

When causing current master node to break down for some reason, owing to cannot broadcast the running status of local monitor warning task, therefore other each nodes cannot receive the state updating information of current master node.After this time period exceedes setting, namely can determine whether that current master node running status goes wrong；

The monitoring alarm service operation status list of all nodal test local maintenances, selects the node currently started the earliest as new master node；

New master node reads monitoring alarm task message queue, obtains current all monitoring alarm tasks, performs described monitoring alarm task reasonable distribution again to all nodes.

Consulting shown in Fig. 3, be in cloud platform monitoring alarm method step S3 of the present invention after having some new monitoring alarm tasks to be created, master node distributes described new monitoring alarm task to the operation process chart from the preferred embodiment of node.

Step S311, master node periodically monitors the message queue of monitoring alarm task.

Step S312, after having some new monitoring alarm tasks to be created, is submitted in monitoring alarm message queue, and master node obtains new mission bit stream from message queue.

Step S313, master node applies rational algorithm distribution monitoring alarm task to each node, sends inter-related task ID to each from node, it is achieved load balancing.

Consult shown in Fig. 4, be when there being the new operation process chart of the processing method preferred embodiment of master node when adding from node in cloud platform monitoring alarm method step S3 of the present invention.

In step S321, master node periodically safeguards the list updating local all node running statuses.

In step S322, master node determines whether newly added node according to running status and the report time of each node in all node running status lists.

In step S323, if there being newly added node, then master node visit safeguards the message queue of monitoring alarm task, obtains new monitoring alarm task.

In step S324, master node is applied rational algorithm and is assigned tasks to newly added node, it is achieved load balancing.

Consult shown in Fig. 5, be in cloud platform monitoring alarm method step S3 of the present invention when there being node to be deleted the operation process chart of processing method preferred embodiment of master node.

In step S331, master node periodically safeguards the list updating local all node running statuses.

In step S332, according to running status and the report time of each node in all node running status lists, master node determines whether that node is deleted.

In step S333, if there being node to be deleted, then master node visit safeguards the message queue of monitoring alarm task, obtains the monitoring alarm task that this node is responsible for.

In step S334, master node is applied rational algorithm and is assigned tasks to the node of all normal operations, it is achieved load balancing.

Although the present invention is described with reference to current better embodiment; but skilled persons will appreciate that; above-mentioned better embodiment is only used for the present invention is described; not it is used for limiting protection scope of the present invention; any within the spirit and principles in the present invention scope; any modification of being done, equivalence replacement, improvement etc., should be included within the scope of the present invention.

Claims

1. a cloud platform monitoring alarm method, it is characterised in that the method comprises the steps:

A. for each node: broadcast the running status of this node, receive the running status of other node broadcasts, periodically update all monitoring nodes alert service running status lists of local maintenance；

B. according to described monitoring alarm service operation status list, it is determined that master node；

C. the master node being confirmed as fulfils the responsibility of master node；

D. monitoring alarm task message queue is safeguarded by message queue protocol.

2. the method for claim 1, it is characterised in that the method also includes step e:

When master node redefines master node after breaking down and distribute task.

3. the method for claim 1, it is characterised in that described monitoring alarm service operation status list includes: the running state information of all nodes and temporal information.

4. method as claimed in claim 3, it is characterised in that described step b specifically includes:

According to the running state information in described monitoring alarm service operation status list and temporal information, it is judged that whether this node is the node started the earliest in the normal node of current operating conditions；

5. the method for claim 1, it is characterised in that described step c specifically includes:

After having some new monitoring alarm tasks to be created, monitoring alarm task new described in master node mean allocation gives all nodes；

New when adding from node or when there being node to be deleted when having, master node redistributes all of monitoring alarm task to each node.

6. method as claimed in claim 5, it is characterised in that described after having some new monitoring alarm tasks to be created, monitoring alarm task new described in master node mean allocation specifically includes to all nodes:

After having some new monitoring alarm tasks to be created, being submitted in monitoring alarm message queue, master node obtains new mission bit stream from message queue；

Master node applies rational algorithm distribution monitoring alarm task to each node, sends inter-related task ID to each from node, it is achieved load balancing.

7. method as claimed in claim 5, it is characterised in that described new when adding from node when having, master node is redistributed all of monitoring alarm task and specifically included to each node:

Master node determines whether newly added node according to running status and the report time of each node in all node running status lists；

If there being newly added node, then master node visit safeguards the message queue of monitoring alarm task, obtains new monitoring alarm task；

Master node is applied rational algorithm and is assigned tasks to newly added node, it is achieved load balancing.

8. method as claimed in claim 5, it is characterised in that described when there being node to be deleted, master node is redistributed all of monitoring alarm task and specifically included to each node:

According to running status and the report time of each node in all node running status lists, master node determines whether that node is deleted；

If there being node to be deleted, then master node visit safeguards the message queue of monitoring alarm task, obtains the monitoring alarm task that this node is responsible for；

Master node is applied rational algorithm and is assigned tasks to the node of all normal operations, it is achieved load balancing.

9. the method for claim 1, it is characterised in that described step d specifically includes:

All nodes all may have access to described monitoring alarm task message queue, therefrom obtains monitoring alarm task；

Newly created monitoring alarm task is inserted into the afterbody of described monitoring alarm task message queue；

Executed monitoring alarm task is deleted from described monitoring alarm task message queue.

10. method as claimed in claim 2, it is characterised in that described step e specifically includes:

If the time period of the state updating information that other each nodes cannot receive current master node exceedes setting, then judge that current master node running status goes wrong；