CN114064414A - High-availability cluster state monitoring method and system - Google Patents

High-availability cluster state monitoring method and system Download PDF

Info

Publication number
CN114064414A
CN114064414A CN202111413336.7A CN202111413336A CN114064414A CN 114064414 A CN114064414 A CN 114064414A CN 202111413336 A CN202111413336 A CN 202111413336A CN 114064414 A CN114064414 A CN 114064414A
Authority
CN
China
Prior art keywords
monitoring
cluster
node
master node
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111413336.7A
Other languages
Chinese (zh)
Inventor
高鸣飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SmartX Inc
Original Assignee
SmartX Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SmartX Inc filed Critical SmartX Inc
Priority to CN202111413336.7A priority Critical patent/CN114064414A/en
Publication of CN114064414A publication Critical patent/CN114064414A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1441Resetting or repowering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • G06F11/3093Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/119Details of migration of file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The embodiment of the invention discloses a method and a system for monitoring a high-availability cluster state, wherein the method comprises the following steps: allocating roles to server nodes in the cluster by using a voting mechanism, wherein the roles comprise a master node for running monitoring service and a slave node for forwarding a monitoring data query request; the main node mounts a data path from the network file system, starts monitoring service after the mounting is successful, and writes monitoring sample data into the data path for storage; and judging whether the running state of the cluster is changed or not, if so, starting the life cycle management of the monitoring service, and enabling the cluster monitoring system to keep a high availability state through the migration main node and the monitoring service. The invention can effectively avoid the loss of monitoring data under the condition that a cluster has single-point failure, realize the rapid transfer of data and monitoring service, ensure the high availability of the system under various states, occupy less resources and have wide application prospect.

Description

High-availability cluster state monitoring method and system
Technical Field
The invention relates to the technical field of cluster state monitoring, in particular to a high-availability cluster state monitoring method and system.
Background
The cluster state monitoring system is a technical framework for acquiring monitoring indexes (including performance indexes, resource utilization indexes, abnormal information and the like) of each node in a cluster into the monitoring system, triggering an alarm according to a set alarm rule and providing data query service. In a specific implementation manner, a monitoring service instance is generally started on a "fat node" (usually, a node with abundant manually-specified computing resources and storage resources) in a cluster, and the service is responsible for collecting each monitoring index of each node in the cluster to a local computer, providing data query service to the outside in a local storage manner, and generating alarm information.
As shown in fig. 1, a system architecture for implementing a monitoring system is shown, in which all servers form a cluster, and each server in the cluster may be referred to as a node. The monitoring system is implemented by starting an exporter on each node to expose monitoring index data which can be captured, pulling performance indexes exposed by each node (including self) in a cluster at regular time by the node with the monitoring service, storing the performance indexes in a local disk, and providing data query service for the outside by the monitoring service by retrieving a local file and sending alarm information according to a set alarm rule.
The existing cluster state monitoring system is based on single node and local storage, and the monitoring service only focuses on how to efficiently store and query the acquired data. Due to the single node service, the following problems will be faced:
1) the monitoring service and data are only stored on one node of the cluster, so that the monitoring system faces the problem of single point of failure and has no fault transfer scheme. As such, the unavailability of the monitoring service node (i.e., the node where the monitoring service is located) may result in the loss of cluster monitoring data and the detection of the health status of the cluster;
2) due to the fact that the monitoring service occupies a large amount of memory and more collected monitoring sample data, the problem that cluster storage resources and memory resources are insufficient when the monitoring service is deployed on multiple nodes at the same time when data redundancy and memory resources occupy too much is caused;
therefore, in view of the above-mentioned drawbacks of the prior art solutions, it is desirable to design and provide a highly available method and system for monitoring systems to solve the above-mentioned technical disadvantages.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and a system for monitoring a high-availability cluster state.
An embodiment of the present invention provides a method for monitoring a high-availability cluster state, including:
allocating roles to server nodes in the cluster by using a voting mechanism, wherein the roles comprise a master node for running a monitoring service and a slave node for forwarding a monitoring data query request;
the main node mounts a data path from a network file system, starts monitoring service after the mounting is successful, and writes monitoring sample data into the data path for storage;
and judging whether the running state of the cluster is changed or not, if so, starting the life cycle management of the monitoring service, and enabling the cluster monitoring system to keep a high availability state through the migration main node and the monitoring service.
Exemplarily, the "the host node mounts a data path from a network file system, starts a monitoring service after the mounting is successful, and writes monitoring sample data into the data path for storage" includes:
the master node holds a global lock and checks whether the monitoring service of the master node is running;
if the data path is not operated, the main node mounts the data path from a network file system, wherein the network file system is NFS;
and starting monitoring service, writing monitoring sample data into the data path for storage, and periodically sending monitoring instance information and heartbeat information to the distributed database.
Exemplarily, the one and only one of the global locks, "holding a global lock by the master node" includes:
the master node writes its own IP information into an authentication service of a distributed database so that the master node is qualified to send heartbeat information to the distributed database.
Exemplarily, the operation states of the cluster include a normal operation state, a cluster restart state, and a master node server failure state.
Exemplarily, the "determining whether the running state of the cluster changes, and if so, starting the life cycle management of the monitoring service" includes:
judging whether the running state of the cluster is restarted or the master node server fails;
if the master node server fails, a new master node is selected again from the cluster by using the voting mechanism to replace the original master node with the failure, the new master node starts monitoring service, and monitoring instance information and heartbeat information in a distributed database are updated;
if the cluster is restarted, the voting mechanism is utilized to redistribute the master and slave roles for the server nodes in the cluster, and the redistributed master nodes finish the mounting of the data storage path and the operation of the monitoring service.
Exemplarily, the "if a master node server fails, reselecting a new master node in the cluster to replace the failed original master node by using the voting mechanism, starting the monitoring service by the new master node, and updating the monitoring instance information and the heartbeat information in the distributed database" includes:
the distributed database judges whether the heartbeat information of the original main node is updated;
if the monitoring instance information of the original main node is not updated any more, the distributed database deletes the monitoring instance information of the original main node;
reselecting a new main node in the cluster by using the voting mechanism;
the new master node holds a global lock and judges whether monitoring instance information exists in the distributed database or not;
if the data storage path does not exist, the new main node mounts the data storage path in the form of a network file system, starts monitoring service on the new main node, and periodically updates monitoring instance information and heartbeat information in the distributed database.
Exemplarily, the new master node periodically determines whether the monitoring instance information in the distributed database belongs to itself, and if not, stops the mounting of the monitoring service and the data path running by itself, releases the global lock, and waits for the takeover of the next new master node.
Exemplarily, the slave node receives the monitoring data query request and forwards the monitoring data query request to the master node, and the master node accesses data on the data path of the network file system according to the monitoring data query request and returns the data.
Exemplarily, the voting election mechanism is a Zookeeper internal election mechanism.
Yet another embodiment of the present invention provides a high availability cluster condition monitoring system, including:
the system comprises a role distribution unit, a role selection unit and a role selection unit, wherein the role distribution unit is used for distributing roles to server nodes in a cluster by using a voting mechanism, and the roles comprise a master node for running a monitoring service and a slave node for forwarding a monitoring data query request;
the data storage unit is used for enabling the main node to mount a data path from a network file system, starting monitoring service after the mounting is successful, and writing monitoring sample data into the data path for storage;
and the monitoring service life cycle management unit is used for judging whether the running state of the cluster is changed or not, starting the life cycle management of the monitoring service if the running state of the cluster is changed, and enabling the cluster monitoring system to keep a high availability state through the migration main node and the monitoring service.
Another embodiment of the present invention provides a terminal, including: a processor and a memory, the memory storing a computer program for executing the computer program to implement the data distribution method of the above-described hyper-fusion system.
Yet another embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed, implements a data distribution method according to one of the above-mentioned hyper-fusion systems.
The method provided by the embodiment of the invention firstly distributes roles for each node in the cluster through a master-slave election mechanism of the cluster, then runs the monitoring service on the master node according to the distributed roles, and the slave node forwards all received requests to the master node, thereby realizing the query of monitoring data under the premise of reasonable resource distribution; then, a network file system is used for storing monitoring data, and data are rapidly migrated when a fault occurs; meanwhile, under the condition that the cluster state changes, the method for realizing high availability of the monitoring system under the states of cluster faults, restarting and the like is realized by using the life cycle management of the monitoring service. The invention can effectively avoid the loss of monitoring data under the condition that a cluster has single-point failure, realize the rapid transfer of data and monitoring service, ensure the high availability of the system under various states, occupy less resources and have wide application prospect.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.
FIG. 1 illustrates a system architecture diagram of a monitoring system in the prior art;
FIG. 2 is a flow chart of a highly available cluster state monitoring method according to an embodiment of the present invention;
FIG. 3 shows a flowchart of the step S102 method of an embodiment of the present invention;
FIG. 4 shows a flowchart of the method of step S103 of an embodiment of the present invention;
FIG. 5 illustrates a flow diagram of a monitoring service lifecycle management method of an embodiment of the present invention;
fig. 6 is a schematic diagram illustrating a highly available cluster state monitoring system according to an embodiment of the present invention.
Description of the main element symbols:
10-a role assignment unit; 20-a data storage unit; 30-monitoring service lifecycle management unit.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
As shown in fig. 1, a cluster state monitoring system in the prior art generally starts a monitoring service instance on a fat node in a cluster, where the monitoring service instance is responsible for acquiring monitoring indexes of each node in the cluster to a local computer, and provides data query service and generates alarm information externally in a local storage manner. In order to overcome the defects that the prior art is greatly influenced by single-point faults and the memory occupation amount is large, the embodiment of the invention provides a high-availability cluster state monitoring method and system, which can effectively avoid the loss of monitoring data under the condition that the cluster has the single-point faults, realize the rapid transfer of data and monitoring services, ensure the high availability of the system under various states, and have the advantages of less resource occupation and wide application prospect.
Example 1
Referring to fig. 2, a method for monitoring a high-availability cluster state includes:
step S101, allocating roles for server nodes in the cluster by using a voting mechanism, wherein the roles comprise a master node for running a monitoring service and a slave node for forwarding a monitoring data query request. Here, the voting mechanism may be a Zookeeper internal voting mechanism. And the slave node receives the monitoring data query request and forwards the monitoring data query request to the master node, and the master node accesses the data on the data path of the network file system according to the monitoring data query request and returns the data.
And S102, mounting the data path from the network file system by the main node, starting monitoring service after the mounting is successful, and writing monitoring sample data into the data path for storage.
And step S103, judging whether the running state of the cluster is changed. If the change occurs, step S1031 is executed to start the life cycle management of the monitoring service, and the cluster monitoring system is kept in a high availability state by migrating the master node and the monitoring service. If not, step S1032 is executed to keep the current master and slave node states. Here, the operation state of the cluster includes a normal operation state, a cluster restart state, and a master node server failure state. The migration of the monitoring service application is carried out along with the monitoring service when the main node is migrated
Specifically, in the embodiment of the present invention, a master-slave election mechanism of a cluster system is utilized, a Zookeeper interface of an open source project is called, a voting election mechanism in the Zookeeper is used to allocate roles (a master node and a slave node) to each node in a cluster, a monitoring service runs on the master node and migrates along with the change of the master node, and the slave node forwards a received monitoring data query request to the master node, so that the query of monitoring data on the premise of reasonable resource allocation is realized.
Compared with the existing scheme, the monitoring data storage mode provided in step S102 in the embodiment of the present invention realizes fast data migration when a failure occurs by using a high performance Network File System (NFS). By using the high-performance network file system, data can be mounted to the corresponding main node as required, the main node tries to mount the data path from the NFS before trying to pull up the monitoring service, and the monitoring service is started after the mounting is successful. The monitoring service writes the pulled monitoring sample data into the path, when the main node fails or changes, the data path automatically cancels mounting from the NFS and mounts to a new main node along with the starting of the new main node monitoring service, and therefore related historical monitoring data can be protected when a single point of failure occurs in the cluster.
The master node is responsible for managing the life cycle of the monitoring service, starting the monitoring service on the own node and detecting the health state of the monitoring service at regular time. When the monitoring service state is abnormal, the main node continuously tries to pull up the monitoring service repeatedly until the monitoring service returns to a normal state. When the monitoring service runs, the main node also can periodically detect the running state of the monitoring service and store the heartbeat information into the distributed database.
The steps will be described in more detail below.
Specifically, as shown in fig. 3, step S102 includes:
step S201, the master node holds a global lock and checks whether its own monitoring service is running. Here, the master node writes its own IP information into an authentication service of the distributed database to qualify the master node for sending heartbeat information to the distributed database. And, the global lock can only be released by the holder, and the non-holding nodes cannot release or preempt the global lock.
Step S202, if the operation is not performed, the main node mounts a data path from a network file system, wherein the network file system is NFS.
Step S203, starting the monitoring service, writing the monitoring sample data into the data path for storage, and periodically sending the monitoring instance information and the heartbeat information to the distributed database.
Specifically, as shown in fig. 4, step S103 includes:
step S301, judging whether the running state of the cluster is restarted or the master node server fails.
Step S3021, if the master node server fails, a voting mechanism is used to reselect a new master node in the cluster to replace the failed original master node, the new master node starts the monitoring service, and the monitoring instance information and the heartbeat information in the distributed database are updated. Here, the heartbeat information reflects which host the monitoring service instance runs on, whether the running state is healthy (if not running, the heartbeat information is stopped from being updated).
Step S3022, if the cluster is restarted, a voting mechanism is used to reassign the master and slave roles to the server nodes in the cluster, and the reassigned master node completes the data storage path mount and the monitoring service operation. Here, whether the heartbeat information of the original master node is updated is mainly judged by the distributed database. If the monitoring instance information of the original main node is not updated any more, the distributed database deletes the monitoring instance information of the original main node, a voting mechanism is used for reselecting a new main node from the cluster, the new main node holds a global lock, and whether the monitoring instance information exists in the distributed database or not is judged. Because the distributed database deletes the monitoring instance information of the original main node, the monitoring instance information in the general distributed database does not exist, at this time, the new main node mounts the data storage path in the form of a network file system, then starts the monitoring service on the new main node, and periodically updates the monitoring instance information and the heartbeat information in the distributed database.
The embodiment of the present invention further explains the above steps by taking the case where a master node in a cluster fails. After the master node fails, the master-slave election mechanism can ensure that a new master node is selected from the cluster through the Zookeeper interface, and at the moment, the monitoring service should fail over. Because the original main node has failed, the holder of the instance information of the monitoring service (the original main node) in the distributed database does not update the heartbeat information any more, and the instance information is automatically deleted by the distributed database after 2 minutes, so that the deadlock problem caused by the downtime of the original main node can be prevented. And the new host node reads the information in the database, and when detecting that the monitoring service instance information does not exist in the distributed database and representing that the monitoring service is not operated in the current cluster, the new host node starts the monitoring service per se and updates the instance information and the heartbeat information of the monitoring service in the distributed database so as to realize the monitoring service transfer during the host node transfer.
Considering that there may be other kinds of nodes that may cause the master node to change, for example, manually changing the master node, if the master node finds that the instance information in the distributed database does not belong to itself, the master node will immediately stop its monitoring service and wait for a new master node to take over and update. After the nodes are updated, the monitoring data is re-mounted to a new main node in a Network File System (NFS) mode, so that the monitoring data can be rapidly transferred.
A specific implementation of step S103 is shown in fig. 5.
First, each master node and each slave node periodically (at an interval of 10 seconds) execute the lifecycle monitoring loop, in the loop, each node first tries to acquire a monitoring service state instance in the distributed database, and when the monitoring service state instance does not exist, tries to insert monitoring instance information into the distributed database, where the information mainly includes 4 parts: (1) the node host information of the current running monitoring service instance, namely the node holding the global lock, is set to be null by default during creation; (2) time of global lock creation; (3) global lock heartbeat time, which the database will delete when it is not updated for more than 2 minutes; (4) the time the monitoring instance was last started.
Then, after the monitoring service state instance exists in the distributed database, the distributed database periodically checks whether the heartbeat information is not updated for more than 2 minutes, and if not, the monitoring service instance is deleted by the database. When the monitoring service state instance exists in the distributed database, the node can check whether the current node is the main node or not by calling the Zookeeper interface. If it is the master node, it will try to hold the global lock (i.e. writing running _ host as its own ip information), and if the holder is not the master node, exit the loop to wait for the next execution. If the holder is the master node, the master node checks whether the own monitoring service is running, if not, the master node tries to mount a monitoring data path on the NFS, starts a monitoring service instance, and finally updates heartbeat information.
If the current node is not the master node, checking whether the current node runs the monitoring service, and if the current node runs the monitoring service, stopping monitoring the service instance and mounting the NFS data path; then, whether the instance holder is self is checked, if the instance holder is self, the instance holder represents that the master node is changed, and the current node is the old master node, so that the global lock needs to be released, the release mode is to set the running _ host to be null, so that the new master node can hold the global lock, and the holding mode is to write the running _ host into self ip information. Therefore, only the node of which the ip is the corresponding ip in running _ host can modify the heartbeat information in the database, and other ips cannot be modified. And the new master node periodically judges whether the monitoring instance information in the distributed database belongs to the new master node, if not, the new master node stops the monitoring service running by the new master node and the mounting of the data path, and waits for the release of the global lock (the global lock released by the original master node or the global lock which does not update the heartbeat in more than 2 minutes is also released by the distributed database). It should be noted that, heartbeat information in the distributed database can be generally used as a global lock, and only the node holding the global lock can run the monitoring service.
Through the method, when fault transfer occurs in the cluster, each node can safely and quickly transfer the monitoring service to a new main node through a life cycle monitoring mechanism of the monitoring service, and the monitoring data is dynamically mounted on a data path through the NFS, so that high availability and quick fault transfer of the monitoring system are realized.
The method comprises the steps that firstly, roles are allocated to all nodes in a cluster through a master-slave election mechanism of the cluster, then monitoring service is operated on a master node according to the allocated roles, and all received requests are forwarded to the master node by slave nodes; the master node stores the instance information and heartbeat information of the monitoring service to a distributed database, keeps timing synchronization and provides monitoring data query service for the outside; before the monitoring service is started, mounting a data storage path from a network file system so as to achieve the purpose of quickly transferring data; and finally, when the master node in the cluster fails, the cluster selects a new master node, updates instance information and heartbeat information of the monitoring service in the database, starts the monitoring service on the new master node, and can mount the monitoring data from the network file system, so that the high availability purpose of the cluster monitoring system is achieved, and the availability and the quality of the cluster monitoring system can be improved under the conditions of avoiding starting multiple monitoring service instances and avoiding data redundancy.
Example 2
As shown in fig. 6, a highly available cluster condition monitoring system includes:
a role allocation unit 10, configured to allocate roles to server nodes in a cluster by using a voting mechanism, where the roles include a master node for running a monitoring service and a slave node for forwarding a monitoring data query request;
a data storage unit 20, configured to enable the host node to mount a data path from a network file system, and start a monitoring service after the mounting is successful, and write monitoring sample data into the data path for storage;
and the monitoring service life cycle management unit 30 is configured to determine whether the operation state of the cluster changes, and if so, start life cycle management of the monitoring service, so that the cluster monitoring system maintains a high availability state through the migration master node and the monitoring service.
It is to be understood that the above-described highly available cluster state monitoring system corresponds to the highly available cluster state monitoring method of embodiment 1. Any of the options in embodiment 1 are also applicable to this embodiment, and will not be described in detail here.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims (10)

1. A high-availability cluster state monitoring method is characterized by comprising the following steps:
allocating roles to server nodes in the cluster by using a voting mechanism, wherein the roles comprise a master node for running a monitoring service and a slave node for forwarding a monitoring data query request;
the main node mounts a data path from a network file system, starts monitoring service after the mounting is successful, and writes monitoring sample data into the data path for storage;
and judging whether the running state of the cluster is changed or not, if so, starting the life cycle management of the monitoring service, and enabling the cluster monitoring system to keep a high availability state through the migration main node and the monitoring service.
2. The method for monitoring the state of the high-availability cluster according to claim 1, wherein the step of the master node mounting a data path from a network file system, starting a monitoring service after the mounting is successful, and writing monitoring sample data into the data path for storage comprises:
the master node holds a global lock and checks whether the monitoring service of the master node is running;
if the data path is not operated, the main node mounts the data path from a network file system, wherein the network file system is NFS;
and starting monitoring service, writing monitoring sample data into the data path for storage, and periodically sending monitoring instance information and heartbeat information to the distributed database.
3. The method of claim 2, wherein the one and only one global lock is "held by the master node" comprises:
the master node writes its own IP information into an authentication service of a distributed database so that the master node is qualified to send heartbeat information to the distributed database.
4. The method according to claim 1, wherein the operating status of the cluster includes a normal operating status, a cluster reboot status, and a master node server failure status.
5. The method for monitoring the high availability cluster state according to claim 3, wherein the step of determining whether the operation state of the cluster changes and starting the life cycle management of the monitoring service if the operation state of the cluster changes comprises:
judging whether the running state of the cluster is restarted or the master node server fails;
if the master node server fails, a new master node is selected again from the cluster by using the voting mechanism to replace the original master node with the failure, the new master node starts monitoring service, and monitoring instance information and heartbeat information in a distributed database are updated;
if the cluster is restarted, the voting mechanism is utilized to redistribute the master and slave roles for the server nodes in the cluster, and the redistributed master nodes finish the mounting of the data storage path and the operation of the monitoring service.
6. The method for monitoring the status of the highly available cluster according to claim 4, wherein the step of, if the master node server fails, reselecting a new master node in the cluster to replace the failed original master node by using the voting mechanism, starting the monitoring service by the new master node, and updating the monitoring instance information and the heartbeat information in the distributed database includes:
the distributed database judges whether the heartbeat information of the original main node is updated;
if the monitoring instance information of the original main node is not updated any more, the distributed database deletes the monitoring instance information of the original main node;
reselecting a new main node in the cluster by using the voting mechanism;
the new master node holds a global lock and judges whether monitoring instance information exists in the distributed database or not;
if the data storage path does not exist, the new main node mounts the data storage path in the form of a network file system, starts monitoring service on the new main node, and periodically updates monitoring instance information and heartbeat information in the distributed database.
7. The method for monitoring the state of the high-availability cluster according to claim 5, wherein the new master node periodically determines whether the monitoring instance information in the distributed database belongs to itself, and if not, stops the monitoring service and the data path which run by itself, and waits for the release of the global lock.
8. The method for monitoring the state of the highly available clusters according to claim 1, wherein the slave node receives the monitoring data query request and forwards the monitoring data query request to the master node, and the master node accesses the data on the data path of the network file system according to the monitoring data query request and returns the data.
9. The method for monitoring the state of the high-availability clusters according to claim 1, characterized in that the voting election mechanism is a Zookeeper internal election mechanism.
10. A high availability cluster condition monitoring system, comprising:
the system comprises a role distribution unit, a role selection unit and a role selection unit, wherein the role distribution unit is used for distributing roles to server nodes in a cluster by using a voting mechanism, and the roles comprise a master node for running a monitoring service and a slave node for forwarding a monitoring data query request;
the data storage unit is used for enabling the main node to mount a data path from a network file system, starting monitoring service after the mounting is successful, and writing monitoring sample data into the data path for storage;
and the monitoring service life cycle management unit is used for judging whether the running state of the cluster is changed or not, starting the life cycle management of the monitoring service if the running state of the cluster is changed, and enabling the cluster monitoring system to keep a high availability state through the migration main node and the monitoring service.
CN202111413336.7A 2021-11-25 2021-11-25 High-availability cluster state monitoring method and system Pending CN114064414A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111413336.7A CN114064414A (en) 2021-11-25 2021-11-25 High-availability cluster state monitoring method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111413336.7A CN114064414A (en) 2021-11-25 2021-11-25 High-availability cluster state monitoring method and system

Publications (1)

Publication Number Publication Date
CN114064414A true CN114064414A (en) 2022-02-18

Family

ID=80276171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111413336.7A Pending CN114064414A (en) 2021-11-25 2021-11-25 High-availability cluster state monitoring method and system

Country Status (1)

Country Link
CN (1) CN114064414A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002122A (en) * 2022-05-09 2022-09-02 中盈优创资讯科技有限公司 Cluster management method and device for data acquisition
CN115102886A (en) * 2022-06-21 2022-09-23 上海驻云信息科技有限公司 Task scheduling method and device for multiple acquisition clients
CN115766405A (en) * 2023-01-09 2023-03-07 苏州浪潮智能科技有限公司 Fault processing method, device, equipment and storage medium
CN115794769A (en) * 2022-10-09 2023-03-14 云和恩墨(北京)信息技术有限公司 Method for managing high-availability database, electronic device and storage medium
CN117407125A (en) * 2023-12-14 2024-01-16 中电云计算技术有限公司 Pod high availability implementation method, device, equipment and readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933132A (en) * 2015-06-12 2015-09-23 广州巨杉软件开发有限公司 Distributed database weighted voting method based on operating sequence number
CN105141456A (en) * 2015-08-25 2015-12-09 山东超越数控电子有限公司 Method for monitoring high-availability cluster resource
US20160323098A1 (en) * 2015-04-28 2016-11-03 United States Government As Represented By The Secretary Of The Navy System and Method for High-Assurance Data Storage and Processing based on Homomorphic Encryption
CN106850759A (en) * 2016-12-31 2017-06-13 广州勤加缘科技实业有限公司 MySQL database clustering methods and its processing system
CN108073460A (en) * 2017-12-29 2018-05-25 北京奇虎科技有限公司 Global lock method for pre-emptively, device and computing device in distributed system
CN110912977A (en) * 2019-11-15 2020-03-24 北京浪潮数据技术有限公司 Configuration file updating method, device, equipment and storage medium
CN110971662A (en) * 2019-10-22 2020-04-07 烽火通信科技股份有限公司 Two-node high-availability implementation method and device based on Ceph
CN112000635A (en) * 2020-08-20 2020-11-27 苏州浪潮智能科技有限公司 Data request method, device and medium
CN112527901A (en) * 2020-12-10 2021-03-19 杭州比智科技有限公司 Data storage system, method, computing device and computer storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160323098A1 (en) * 2015-04-28 2016-11-03 United States Government As Represented By The Secretary Of The Navy System and Method for High-Assurance Data Storage and Processing based on Homomorphic Encryption
CN104933132A (en) * 2015-06-12 2015-09-23 广州巨杉软件开发有限公司 Distributed database weighted voting method based on operating sequence number
CN105141456A (en) * 2015-08-25 2015-12-09 山东超越数控电子有限公司 Method for monitoring high-availability cluster resource
CN106850759A (en) * 2016-12-31 2017-06-13 广州勤加缘科技实业有限公司 MySQL database clustering methods and its processing system
CN108073460A (en) * 2017-12-29 2018-05-25 北京奇虎科技有限公司 Global lock method for pre-emptively, device and computing device in distributed system
CN110971662A (en) * 2019-10-22 2020-04-07 烽火通信科技股份有限公司 Two-node high-availability implementation method and device based on Ceph
CN110912977A (en) * 2019-11-15 2020-03-24 北京浪潮数据技术有限公司 Configuration file updating method, device, equipment and storage medium
CN112000635A (en) * 2020-08-20 2020-11-27 苏州浪潮智能科技有限公司 Data request method, device and medium
CN112527901A (en) * 2020-12-10 2021-03-19 杭州比智科技有限公司 Data storage system, method, computing device and computer storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002122A (en) * 2022-05-09 2022-09-02 中盈优创资讯科技有限公司 Cluster management method and device for data acquisition
CN115102886A (en) * 2022-06-21 2022-09-23 上海驻云信息科技有限公司 Task scheduling method and device for multiple acquisition clients
CN115794769A (en) * 2022-10-09 2023-03-14 云和恩墨(北京)信息技术有限公司 Method for managing high-availability database, electronic device and storage medium
CN115794769B (en) * 2022-10-09 2024-03-19 云和恩墨(北京)信息技术有限公司 Method for managing high-availability database, electronic equipment and storage medium
CN115766405A (en) * 2023-01-09 2023-03-07 苏州浪潮智能科技有限公司 Fault processing method, device, equipment and storage medium
CN117407125A (en) * 2023-12-14 2024-01-16 中电云计算技术有限公司 Pod high availability implementation method, device, equipment and readable storage medium
CN117407125B (en) * 2023-12-14 2024-04-16 中电云计算技术有限公司 Pod high availability implementation method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
US11853263B2 (en) Geographically-distributed file system using coordinated namespace replication over a wide area network
CN114064414A (en) High-availability cluster state monitoring method and system
US10122595B2 (en) System and method for supporting service level quorum in a data grid cluster
CN110377395B (en) Pod migration method in Kubernetes cluster
EP3069495B1 (en) Client-configurable security options for data streams
EP3069228B1 (en) Partition-based data stream processing framework
CN110362390B (en) Distributed data integration job scheduling method and device
CN105814544B (en) System and method for supporting persistent partition recovery in a distributed data grid
US20150278244A1 (en) Geographically-distributed file system using coordinated namespace replication over a wide area network
US10630566B1 (en) Tightly-coupled external cluster monitoring
JP6123626B2 (en) Process resumption method, process resumption program, and information processing system
CN105493474A (en) System and method for supporting partition level journaling for synchronizing data in a distributed data grid
CN110377664B (en) Data synchronization method, device, server and storage medium
CN115292408A (en) Master-slave synchronization method, device, equipment and medium for MySQL database
CN105830029B (en) system and method for supporting adaptive busy-wait in a computing environment
US10324811B2 (en) Opportunistic failover in a high availability cluster
US11449241B2 (en) Customizable lock management for distributed resources
CN111897626A (en) Cloud computing scene-oriented virtual machine high-reliability system and implementation method
US20150169236A1 (en) System and method for supporting memory allocation control with push-back in a distributed data grid
CN115470303A (en) Database access method, device, system, equipment and readable storage medium
US20240028611A1 (en) Granular Replica Healing for Distributed Databases
JP4485560B2 (en) Computer system and system management program
CN113342511A (en) Distributed task management system and method
JPH09319720A (en) Distributed process managing system
CN115964353B (en) Distributed file system and access metering method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220218

RJ01 Rejection of invention patent application after publication