CN112994935B

CN112994935B - prometheus management and control method, device, equipment and storage medium

Info

Publication number: CN112994935B
Application number: CN202110171945.XA
Authority: CN
Inventors: 刘田龙; 杨乐; 马兵兵; 邓沛沛
Original assignee: Fiberhome Telecommunication Technologies Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2022-06-17
Anticipated expiration: 2041-02-04
Also published as: CN112994935A

Abstract

The invention discloses a method, a device, equipment and a storage medium for managing and controlling prometheus, wherein the method comprises the steps of acquiring each router instance from a distributed router system, and deploying each router instance as a sidecar in each prometheus service; monitoring each instance in the prometheus service by using a gossip protocol, voting for fault transfer when the failure or the service abnormality of the main instance of the prometheus is monitored, and determining a new main instance as a target main instance; after the fault transfer is completed, collecting each monitoring index of the monitoring service through the target main instance, and writing each monitoring index into a time sequence database for persistence; the high availability and data consistency of the prometheus cluster can be realized, and the data persistence is also realized; the flexibility is high, the efficiency is higher, and the expansibility is better.

Description

prometheus management and control method, device, equipment and storage medium

Technical Field

The invention relates to the field of cloud-native container technology, in particular to a prometheus management and control method, device, equipment and storage medium.

Background

At present, under the scene of a plurality of container arrangement engine kubernets container platforms, a system monitoring and alarming framework prometheus adopts single-point deployment, and if the prometheus service is abnormal or off-line, data is incomplete or lost.

For the single-point problem, the following solutions are common at present: 1. high Availability (HA), i.e. service Availability, users only need to deploy multiple sets of prometheus service instances and acquire the same export Exporter target; the basic HA mode can only ensure the availability of the prometheus service, but does not solve the data consistency problem and persistence problem between prometheus services, i.e. data cannot be recovered after being lost, and dynamic expansion cannot be performed. Therefore, the deployment method is suitable for the situation that the monitoring scale is not large, the prometheus service does not frequently migrate, and only short-period monitoring data needs to be stored.

The method solves the problems of single point and inconsistent data, but if the root instance has a problem, the monitoring system cannot normally operate, and meanwhile data persistence cannot be ensured.

Disclosure of Invention

The invention mainly aims to provide a method, a device, equipment and a storage medium for prometheus management and control, and aims to solve the technical problems that data among prometheus services are inconsistent and data persistence cannot be guaranteed in the prior art.

In a first aspect, the invention provides a prometheus management and control method, which comprises the following steps:

acquiring each router instance from the distributed observer router system, and deploying each router instance as a sidecar in each monitoring alarm framework prometheus service;

monitoring each instance in the prometheus service by using a gossip protocol, voting for fault transfer when the failure or the service abnormality of the main instance of the prometheus is monitored, and determining a new main instance as a target main instance;

and after the fault transfer is finished, acquiring each monitoring index of the monitoring service through the target main instance, and writing each monitoring index into a time sequence database for persistence.

Optionally, the obtaining the watchdog instances from the distributed observer watchdog system, and deploying the watchdog instances as sidecars in each monitoring alarm framework prometheus service includes:

acquiring each prometheus instance of each monitoring alarm framework prometheus service, and deploying each prometheus instance in a container editing engine kubernets cluster;

each watchdog instance is obtained from the distributed observer watchdog system and is deployed as a sidecar in the data structure pod of each prometheus instance.

Optionally, the monitoring, by using a gossip protocol, each instance in the prometheus service, voting for failover when a failure or service abnormality of a main instance of prometheus is monitored, and determining a new main instance as a target main instance includes:

acquiring information of a master instance and a slave instance from a prometheus service by using a gossip protocol, and judging whether the master instance has a fault or is abnormal in service according to the information of the master instance and the slave instance;

when the failure or service abnormality of a main instance of the prometheus is monitored, a main instance state abnormality command is generated;

and sending the main instance state abnormal command to other watchers, receiving voting information, performing fault transfer according to the voting information, and determining a new main instance as a target main instance.

Optionally, the obtaining, by using a gossip protocol, information of a master instance and a slave instance from a prometheus service, and determining whether the master instance fails or is abnormal in service according to the information of the master instance and the slave instance, includes:

reading a configuration file of prometheus from a prometheus service by using a gossip protocol, acquiring instance identifiers of all instances from the configuration file, and selecting an instance corresponding to the identifier with the minimum number from the instance identifiers as a main instance;

writing the main instance information of the main instance into a monitoring configuration file of a target watchdog instance, and loading the monitoring configuration file when the target watchdog instance is started;

after detecting that the target watchdog instance is normally started, connecting other watchdog instances and other slave instances except the master instance in the prometheus service;

the target dispatcher instance acquires master instance and slave instance information from a cluster corresponding to the proxy service periodically;

and determining the current state of the main instance according to the information of the main instance and the slave instance, and judging whether the main instance has a fault or abnormal service according to the current state.

Optionally, the generating a main instance status exception command when a failure or service exception of the main instance of the premethenus is monitored includes:

when the failure or service abnormality of a main instance of a prometheus is monitored, taking a current watchdog instance corresponding to the main instance as an initiator;

generating a master instance status exception command asking other watchers instances whether to agree with the current watcher instance as the initiator.

Optionally, the sending the main instance state exception command to another watchdog instance, receiving voting information, performing failover according to the voting information, and determining a new main instance as a target main instance includes:

sending the main instance state abnormal command to other watchers, and receiving voting information fed back by other watchers;

when the voting information shows that the agreement proportion is larger than the preset proportion, eliminating abnormal prometheus instances by using the current watch instance as an initiator to obtain a residual prometheus instance cluster;

and selecting an instance with the minimum instance identifier from the rest of the promemeus instance clusters as a new main instance, or selecting an instance with the maximum copy offset from the rest of the promemeus instance clusters as a new main instance, and taking the new main instance as a target main instance.

Optionally, after the failover is completed, collecting, by the primary instance, each monitoring index of the monitoring service, and writing each monitoring index into a time sequence database for persistence, where the collecting includes:

after the failover is completed, reading configuration information of the monitoring service from a configuration server through the main instance;

collecting each monitoring index of the monitoring service according to the configuration information, and copying and synchronizing each monitoring index to other prometheus instances;

and writing each monitoring index into the timing sequence database influxdb for persistence.

In a second aspect, to achieve the above object, the present invention further provides a prometheus management and control device, including:

the deployment module is used for acquiring each router instance from the distributed observer router system and deploying each router instance as a sidecar in each monitoring alarm framework promemeus service;

the anomaly monitoring module is used for monitoring each instance in the prometheus service by using a gossip protocol, voting for fault transfer when the failure or the service anomaly of the main instance of the prometheus is monitored, and determining a new main instance as a target main instance;

and the persistence module is used for acquiring each monitoring index of the monitoring service through the main instance after the fault transfer is completed, and writing each monitoring index into the time sequence database for persistence.

In a third aspect, to achieve the above object, the present invention further provides a promemeus management and control device, where the promemeus management and control device includes: a memory, a processor, and a prometheus hypervisor stored on the memory and executable on the processor, the prometheus hypervisor configured to implement the steps of the prometheus hypervisor method as recited in claim above.

In a fourth aspect, to achieve the above object, the present invention further provides a storage medium, where a prometheus management program is stored, and when being executed by a processor, the prometheus management program implements the steps of the prometheus management method described above.

According to the method for managing and controlling the prometheus, each watchdog instance is obtained from a distributed observer watchdog system and deployed in each monitoring alarm framework prometheus service as a sidecar; monitoring each instance in the proxy service by using a gossip protocol, voting for fault transfer when the fault or the abnormal service of the master instance of the proxy is monitored, and determining a new master instance as a target master instance; after the fault transfer is completed, collecting each monitoring index of the monitoring service through the main example, and writing each monitoring index into a time sequence database for persistence; the method has the advantages that fault transfer can be carried out when the service of the main instance is abnormal, a new main instance is determined, the whole cluster can be off-line, the problems of single point and data inconsistency are solved, remote storage is realized through a time sequence database, high availability of a prometheus cluster is realized, data consistency is guaranteed, and data persistence is also realized; compared with the scheme depending on the third-party component, the scheme is more flexible, high in safety, not easy to be influenced by the third-party component, higher in efficiency and better in expansibility.

Drawings

FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for managing and controlling prometheus according to a first embodiment of the present invention

FIG. 3 is a schematic diagram illustrating an architecture of a dispatcher acquiring proxy master-slave instance configuration information in the proxy management and control method according to the present invention;

FIG. 4 is a schematic diagram of a synchronization architecture for synchronizing status information of a main instance of the proxy among the watchers and status of the watchers in the proxy management and control method according to the present invention;

FIG. 5 is a flowchart illustrating a method for managing prometheus according to a second embodiment of the present invention;

FIG. 6 is a framework for keeping alive heartbeat between a watchdog instance and each promemeus instance in the promemeus policing method according to the present invention;

FIG. 7 is a flowchart illustrating a prometheus management and control method according to a third embodiment of the present invention;

FIG. 8 is a flowchart illustrating a method for managing prometheus according to a fourth embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating a collaboration flow of a prometheus cluster in the prometheus management and control method of the present invention;

FIG. 10 is a schematic diagram illustrating abnormal or offline detection of prometheus main instance service in the prometheus management and control method according to the present invention;

FIG. 11 is a flowchart illustrating a prometheus management and control method according to a fifth embodiment of the present invention;

FIG. 12 is a flowchart illustrating a prometheus management and control method according to a sixth embodiment of the present invention;

FIG. 13 is a flowchart illustrating failover in a prometheus management and control method according to the present invention;

FIG. 14 is a flowchart illustrating a prometheus management and control method according to a seventh embodiment of the present invention;

fig. 15 is a functional block diagram of a prometheus management and control apparatus according to a first embodiment of the present invention.

The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The solution of the embodiment of the invention is mainly as follows: acquiring each router instance from a distributed observer router system, and deploying each router instance as a sidecar in each monitoring alarm framework promemeus service; monitoring each instance in the prometheus service by using a gossip protocol, voting for fault transfer when the failure or the service abnormality of the main instance of the prometheus is monitored, and determining a new main instance as a target main instance; after the fault transfer is completed, acquiring each monitoring index of the monitoring service through the target main instance, and writing each monitoring index into a time sequence database for persistence; the method has the advantages that fault transfer can be carried out when the service of the main instance is abnormal, a new main instance is determined, the whole cluster can be off-line, the problems of single point and data inconsistency are solved, remote storage is realized through a time sequence database, high availability of a prometheus cluster is realized, data consistency is guaranteed, and data persistence is also realized; compared with a scheme depending on a third-party component, the scheme is more flexible, high in safety, not easy to be influenced by the third-party component, higher in efficiency and better in expansibility, and the technical problems that data among prometheus services are inconsistent and data persistence cannot be guaranteed in the prior art are solved.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a prometheus management and control device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the prometheus management and control device may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used to implement connection communication among these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a Wi-Fi interface). The Memory 1005 may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a prometheus hypervisor.

The present device calls, by the processor 1001, a prometheus management program stored in the memory 1005, which when executed by the processor implements the steps of the prometheus management method embodiments described above.

According to the scheme, the method comprises the steps that each watchdog instance is acquired from a distributed observer watchdog system, and each watchdog instance is deployed in each monitoring alarm framework proxy service as a sidecar; monitoring each instance in the proxy service by using a gossip protocol, voting for fault transfer when the fault or the abnormal service of the master instance of the proxy is monitored, and determining a new master instance as a target master instance; after the fault transfer is completed, acquiring each monitoring index of the monitoring service through the target main instance, and writing each monitoring index into a time sequence database for persistence; the method has the advantages that fault transfer can be carried out when the service of the main instance is abnormal, a new main instance is determined, the whole cluster can be off-line, the problems of single point and data inconsistency are solved, remote storage is realized through a time sequence database, high availability of a prometheus cluster is realized, data consistency is guaranteed, and data persistence is also realized; compared with the scheme depending on the third-party component, the scheme is more flexible, high in safety, not easy to be influenced by the third-party component, higher in efficiency and better in expansibility.

Based on the above hardware structure, an embodiment of the prometheus management and control method of the present invention is provided.

Referring to fig. 2, fig. 2 is a schematic flowchart of a prometheus management and control method according to a first embodiment of the present invention.

In a first embodiment, the prometheus management and control method includes the following steps:

and step S10, obtaining the watchdog instances from the distributed observer watchdog system, and deploying the watchdog instances as sidecar in each monitoring alarm framework promemetus service.

It should be noted that the observer watchdog system is a preset distributed watchdog system, a plurality of watchdog instances exist in the watchdog system, and each watchdog instance can be deployed in each monitoring alarm framework promemeus service as a sidecar, so as to achieve the purpose of monitoring the promemeus service in real time.

It is understood that the open source system monitoring and alarm framework prometheus is an open source monitoring alarm solution by soundlog; prometheus stores time-sequential data, i.e., sets of consecutive data are stored in the time dimension in the same time sequence (same name and label).

In a specific implementation, as shown in fig. 3, fig. 3 is an architecture diagram illustrating that a dispatcher acquires the configuration information of a prometheus master-slave instance in the prometheus management and control method of the present invention; referring to fig. 3, a master is a prometheus master instance, a slave0 and a slave1 are prometheus slave instances, each router sends an instance information request InstanceInfo to the prometheus master instance at a fixed period to request to acquire the configuration of other slave instances, that is, to acquire the configuration information of other slave instances and the locally stored information of master-slave instance monitoring indexes, and the configuration of the router mainly includes the information of the master instance, acquires all slave instance information by sending InstanceInfo to the master instance, and can sense when a new slave instance joins a cluster.

Step S20, monitoring each instance in the prometheus service by using gossip protocol, voting for failover when a failure or service abnormality of the main instance of prometheus is monitored, and determining a new main instance as the target embodiment.

It should be noted that the gossip protocol is a communication protocol, and a message propagation method, the basic idea of the gossip protocol is as follows: a node wants to share some information to other nodes in the network. Then it periodically randomly selects some nodes and passes the information to them; the nodes that receive the information then do the same thing, i.e., pass the information to other randomly selected nodes.

In a specific implementation, the prometheus service: the system is responsible for collecting data, storing the data to the local, synchronizing the data to the remote storage, inquiring the data and providing an alarm notice; synchronizing the state of the prometheus main instance and the state of each watchdog among various watchdog instances; as shown in fig. 4, fig. 4 is a schematic diagram of a synchronization architecture for synchronizing status information of a premechus master instance and status of a router itself among the router instances in the premechus management and control method of the present invention; referring to fig. 4, each of the watchers (

watchers

0, 1, 2) sends the master instance state and the watcher configuration information to the other watchers (

watchers

0, 1, 2) for a fixed period, and performs the proxy master instance state information synchronization.

It can be understood that, by using the gossip protocol, each instance in the prometheus service can be monitored, when a failure or a service abnormality of the main instance of prometheus is monitored, a failover can be performed by voting, and a new main instance is determined as a target embodiment, that is, when the failure or the abnormality of the main instance is found, whether the failover is performed or not is determined by voting of each instance, and which instance is determined as the new main instance is determined according to the voting.

In a specific implementation, each router sends instance information instancelnfo to a prometheus master instance and a prometheus slave instance periodically through the gossip protocol, so as to request to obtain the configuration of other instances, thereby achieving the purpose of monitoring prometheus.

And step S30, after the fault transfer is completed, acquiring each monitoring index of the monitoring service through the target main instance, and writing each monitoring index into a time sequence database for persistence.

It should be understood that after the failover is completed, various monitoring metrics of the monitoring service may be collected by the new master instance, while these metrics are copied to other prometheus instances, and persistency is implemented by writing to the timing database infiluxdb.

Further, after the persistence operation is completed, if the monitoring data needs to be observed, the monitoring data may be read from the time sequence database infiluxdb through the client for display, and may also be displayed after each monitoring index in the monitoring data is analyzed, which is not limited in this embodiment.

Further, fig. 5 is a flowchart illustrating a method for managing prometheus according to a second embodiment of the present invention, and as shown in fig. 5, the method for managing prometheus according to the second embodiment of the present invention is proposed based on the first embodiment, in this embodiment, the step S10 specifically includes the following steps:

step S11, obtaining each prometheus instance of each monitoring alarm framework prometheus service, and deploying each prometheus instance in a container orchestration engine kubernets cluster.

It should be noted that the container arrangement engine kubernets is an open source, and is used for managing containerized applications on multiple hosts in a cloud platform, and after obtaining each prometheus instance of each monitoring alarm framework prometheus service, each prometheus instance may be deployed in a container arrangement engine kubernets cluster, so that management, discovery, and access of each prometheus instance are realized through a built-in load balancing policy.

And step S12, acquiring each watchdog instance from the distributed observer watchdog system, and deploying each watchdog instance as sidecar in the data structure pod of each prometheus instance.

It can be understood that the watchdog system is a pre-designed set of distributed watchdog system, wherein each watchdog instance in the watchdog system can be deployed as a sidecar in the prometheus service, that is, in the data structure pod of each prometheus instance.

In a specific implementation, as shown in fig. 6, fig. 6 is a framework for keep-alive heartbeat of a watchdog instance and each prometheus instance in the prometheus management and control method of the present invention; referring to FIG. 6, a master is a master instance of prometheus, slave0 and slave1 are slave instances of prometheus, and each router (router 0, router 1 and router 2) performs a heartbeat detection on the master instance, the slave instance and other router instances to keep alive each instance; that is, each watchdog instance sends the state of the master instance and the configuration information of the watchdog instance to other watchdog instances in a fixed period, and the fixed period makes one heartbeat detection to the master instance, the slave instance and other watchdog instances to realize the state monitoring of each instance and keep the activity.

According to the scheme, each promemeus instance of each monitoring alarm framework promemeus service is obtained, and each promemeus instance is deployed in a container arrangement engine kubernets cluster; acquiring each watchdog instance from the distributed observer watchdog system, and deploying each watchdog instance as a sidecar in a data structure pod of each promemeus instance; the method and the device can improve the data stability of the proxy service, realize the high availability of the proxy cluster, and compared with a scheme depending on a third-party component, the method and the device are more flexible, high in safety, not easy to be influenced by the third-party component, higher in efficiency and better in expansibility.

Further, fig. 7 is a flowchart illustrating a method for managing prometheus according to a third embodiment of the present invention, and as shown in fig. 7, the third embodiment of the method for managing prometheus is proposed based on the first embodiment, in this embodiment, the step S20 specifically includes the following steps:

step S21, obtaining information of a master instance and a slave instance from a prometheus service by using a gossip protocol, and judging whether the master instance has a fault or is abnormal in service according to the information of the master instance and the slave instance.

It should be noted that, each router using the gossip protocol will constantly and periodically check whether the master instance and the slave instance of the prometheus service operate normally, that is, the router instance receives information whether the master instance of the prometheus is offline using the gossip protocol, the master instance and the slave instance information are prometheus master instance information and slave instance information, and whether the master instance fails or the service is abnormal is determined according to the master instance and the slave instance information.

It can be understood that each watchdog instance (e.g., watchdog 0, watchdog 1, watchdog 2.) may send instance information instancelnfo requests to the premeheus master instance and slave instances at regular periods, so as to obtain configuration information of the master instance and other slave instances, that is, send instancelnfo to the master instance through the watchdog instance, and obtain all slave instance information; and may be perceived by the watchdog instance when a new slave instance joins the cluster.

It should be understood that the master instance state of the master instance can be determined through the master instance and slave instance information, and whether the master instance has other abnormal states such as offline or not can be determined.

And step S22, generating a main instance state abnormal command when the failure or the service abnormality of the main instance of the prometheus is monitored.

It can be understood that when a failure or service abnormality of a prometheus master instance is monitored, a corresponding master instance state abnormality command can be correspondingly generated, where the master instance state abnormality command includes information related to the current master instance state abnormality and also includes request information for requesting another slave instance to vote.

And step S23, sending the main instance state abnormal command to other watchers, receiving voting information, carrying out fault transfer according to the voting information, and determining a new main instance as a target main instance.

It should be understood that each online watchdog instance may initiate a failover process, and when a watchdog instance confirms that the master instance is abnormal, a master instance status abnormal command is sent to other watchdog instances, and it is required to set itself as an initiator, and the initiator handles the failover; when the other watch receives this command, it may be granted or denied to become the initiator; when most of the watchers agree, a failover operation is performed by the initiating watcher at this time, and the new master instance is determined as the target master instance.

In a specific implementation, when a failure or service abnormality of a main instance of a prometheus is monitored, synchronizing an abnormal event of the main instance to other watchers, and stopping a synchronous copy state of the main instance of the prometheus; when a Prometheus system is started, which Prometheus instance is used as a main instance can be determined according to the instance priority number of a Prometheus self-configuration file, and after the watchdog acquires the main instance information of the Prometheus main instance, the Prometheus main instance information can be written into the configuration information of the watchdog; each watchdog fixed period sends master instance state and watchdog information to other watchers, and each watchdog fixed period carries out heartbeat detection to the master instance, the slave instance and other watchdog instances; when most routers consider that the premethuus main instance has a fault or the service is abnormal, the initiator voted by the routers can carry out fault transfer, a new main instance is selected, and the original slave instance can initiate replication to the new main node; when the failover is completed, the main instance starts to read the configuration information of the monitoring service from the configuration server; the main instance starts to collect various monitoring indexes of the monitoring service and copies the indexes to other prometheus instances; meanwhile, the main instance can remotely write the collected indexes into the influxdb for persistence.

According to the scheme, the main instance and the slave instance information are obtained from the prometheus service by using the gossip protocol, and whether the main instance has a fault or is abnormal in service is judged according to the main instance and the slave instance information; when the failure or the service abnormality of the main instance of the proxy is monitored, generating a main instance state abnormality command; sending the main instance state abnormal command to other watchers, receiving voting information, performing fault transfer according to the voting information, and determining a new main instance as a target main instance; the method and the system have the advantages that fault transfer can be rapidly carried out when the main instance service is abnormal, a new main instance is determined, the whole cluster can be disconnected, the problems of single-point problems and data inconsistency are solved, the scheme is more flexible, the safety is high, the influence of a third-party component is not prone to occurring, the efficiency is higher, and the expansibility is better.

Further, fig. 8 is a schematic flowchart of a method for managing prometheus according to a fourth embodiment of the present invention, and as shown in fig. 8, the method for managing prometheus according to the fourth embodiment of the present invention is proposed based on the third embodiment, in this embodiment, the step S21 specifically includes the following steps:

step S211, reading a configuration file of prometheus from a prometheus service by using a gossip protocol, acquiring instance identifiers of each instance from the configuration file, and selecting the instance corresponding to the identifier with the minimum number from the instance identifiers as a main instance.

It should be understood that, using the gossip protocol, a profile of promemeus can be read from a promemeus service, where the profile has an instance identifier corresponding to each promemeus instance, such as ins0, ins1, ins2, and the like, and generally the default identifier number is the smallest, but it is also possible to select the main instance by using other elements, for example, the instance with the largest copy offset is selected as the main instance, which is not limited in this embodiment; besides the instance identifier, the configuration file also comprises configuration information of other instances.

And step S212, writing the main instance information of the main instance into a monitoring configuration file of the target watch instance, and loading the monitoring configuration file when the target watch instance is started.

It can be understood that, the target watchdog instance is a watchdog instance which is currently detecting the main instance, the prometheus main instance writes main instance information into a configuration file of the target watchdog, the target watchdog instance loads the configuration file when starting, the configuration file contains the prometheus main instance information, and a watchdog instance priority number sid 0, 1, 2 and the like is used as an identifier, and the watchdog instance priority number sid is a number for performing index identification.

And step S213, after detecting that the target watchdog instance is normally started, connecting other watchdog instances and other slave instances except the master instance in the prometheus service.

It should be understood that after the watchdog instance is normally started, other watchdog instances and prometheus instances are connected, so that data intercommunication between the watchdog instances is facilitated, and when a problem occurs in a monitored instance, the watchdog notifies the state of the problematic instance to the other watchers through a hypertext Transfer Protocol (HTTP) Protocol; the foregoing gossip protocol is a protocol used by the watch monitoring master instance, and the HTTP protocol is a transmission protocol between various watches for monitoring status information of the prometheus instance.

In a specific implementation, as shown in fig. 9, fig. 9 is a schematic diagram of a collaboration flow of a prometheus cluster in the prometheus management and control method of the present invention, referring to fig. 9, S400, and configuration files of prometheus are read when the prometheus starts to be started, where each configuration file has instance identifiers of ins0, ins1, ins2, and the default identifier number is the smallest as a main instance, and configuration information of other instances; s401, writing the main instance information into a configuration file of the watchdog by the prometheus main instance; loading a configuration file when a dispatcher instance starts (premethenus main instance information, and dispatcher instance priority number sid being 0, 1, 2 and the like as identifiers); s402, starting the watchdog instance normally, and connecting other watchdog instances and prometheus instances; s403, the watchdog acquires information of the master instance and the slave instance from the premethous cluster at regular time, and the watchdog periodically and dynamically monitors the state of the premethous master instance; when the main instance is detected to have a fault or abnormal service, the election of a new main instance needs to be started, the slave instance information is necessary, and the configuration information and the state of the slave instance information become the key for constructing the new main instance; s404, synchronizing the status of the prometheus main instance and the status of each watchdog among each watchdog instance; s405, the prometheus main instance acquires monitoring object services needing to be acquired from the configuration file, namely discovery virtual circuit (SVC) discovery; s406, the prometheus main instance starts to collect indexes of service (SVC1, SVC2), and the indexes are stored in a local time sequence database, wherein the prometheus provides local storage; the advantage of local storage is that operation and maintenance are simple, and the disadvantage is that massive metrics cannot be persisted and the risk of data loss exists; the time sequence database infiluxdb is used for solving the single node fault and storage limitation, and is used as a prometheus remote storage system to realize the expansibility of prometheus; s407, copying the synchronous monitoring index to other prometheus slave instances by the prometheus master instance; s408, remotely synchronizing the local indexes to the infiluxdb cluster of the back-end time sequence database by the prometheus main instance; and S409, the client acquires the indexes from the time sequence database infiluxdb for displaying and analyzing.

Step S214, the master instance periodically acquires information of the master instance and the slave instance from the cluster corresponding to the prometheus service.

It should be noted that the master instance may periodically send instance information InstanceInfo requests to the prometheus master instance and slave instances in the cluster corresponding to the prometheus service, so as to obtain master instance and slave instance information, where the master instance and slave instance information includes tags, IPs, ports, and the like of the master instance and the slave instances.

Step S215, determining the current state of the main instance according to the information of the main instance and the auxiliary instances, and judging whether the main instance has a fault or is abnormal in service according to the current state.

It can be understood that the master instance and the slave instance information may describe state information of the prometheus master instance, and the watchdog instance may periodically acquire the master instance and the slave instance information from the prometheus cluster, so as to implement periodic dynamic monitoring of the state of the prometheus master instance by the watchdog instance, and determine whether the state of the prometheus master instance is abnormal or not according to the current state of the master instance, and when the state of the master instance is abnormal for service or a fault of the master instance or offline for instance, determine whether the master instance is faulty or abnormal for service.

In a specific implementation, as shown in fig. 10, fig. 10 is a schematic diagram illustrating abnormal or offline detection of a prometheus main instance service in the prometheus management and control method of the present invention; referring to fig. 10, S500, synchronizing the state of the prometheus master instance and the state of each watchdog between each watchdog instance; s501, the watchdog instance periodically and dynamically monitors the state of the prometheus main instance; s502, finding out the status exception of the prometheus main instance by the instance of the watchdog 0, and synchronizing the exception event of the main instance to other instances of the watchdog; s503, stopping the synchronous copy state of the prometheus main instance.

In this embodiment, through the above scheme, a gossip protocol is used to read a prometheus configuration file from a prometheus service, obtain instance identifiers of each instance from the configuration file, and select an instance corresponding to an identifier with the smallest number from the instance identifiers as a main instance; writing the main instance information of the main instance into a monitoring configuration file of a target watchdog instance, and loading the monitoring configuration file when the target watchdog instance is started; after detecting that the target watch instance is normally started, connecting other watch instances and other slave instances except the master instance in the promemeus service; the target dispatcher instance acquires master instance and slave instance information from a cluster corresponding to the proxy service periodically; determining the current state of the main instance according to the information of the main instance and the slave instance, and judging whether the main instance has a fault or is abnormal in service according to the current state; the method and the device can quickly determine the service abnormity of the main instance, improve the accuracy of determining the fault or the service abnormity of the main instance, shorten the fault judgment time, improve the speed of fault judgment, and further improve the speed and the efficiency of fault transfer when the service abnormity of the main instance occurs.

Further, fig. 11 is a schematic flowchart of a method for managing prometheus according to a fifth embodiment of the present invention, and as shown in fig. 11, the method for managing prometheus according to the fifth embodiment of the present invention is provided based on the third embodiment, in this embodiment, the step S22 specifically includes the following steps:

step S221, when the failure or the service abnormality of the main instance of the prometheus is monitored, taking the current watchdog instance corresponding to the main instance as an initiator.

It should be noted that, when a failure or service abnormality of the main instance of the premethenus is monitored, the current watchdog instance may be used as an initiator by default, so as to notify other watchdog instances; synchronizing the state of the prometheus main instance and the state of each watchdog among various watchdog instances; and the watch instance periodically and dynamically monitors the status of the prometheus master instance.

Step S222, generating a main instance state exception command for inquiring whether other watch instances agree to the current watch instance as an initiator.

It can be understood that, when a failure or service abnormality of the prometheus master instance is monitored, a master instance status exception command may be generated to query other watchers, where the master instance status exception command includes information about the master instance status exception and queries whether to approve the current watcher instance as the request information of the initiator.

In a specific implementation, after the current dispatcher instance finds that the prometheus master instance is offline, the current dispatcher instance notifies other dispatcher instances of the master instance state and asks other dispatcher instances whether the feedback agrees; if most of the watchdog instances indicate agreement; then this time the master instance is selected starting with the current watch instance.

According to the scheme, when the failure or the service abnormality of the main instance of the prometheus is monitored, the current watchdog instance corresponding to the main instance is used as an initiator; generating a master instance status exception command asking other watchers instances whether to approve the current watcher instance as an initiator; the failure transfer can be carried out when the main instance service is abnormal through a voting mechanism, a new main instance is determined, the whole cluster can be off-line, and the problems of single point and inconsistent data are solved; compared with the scheme depending on the third-party component, the scheme is more flexible, high in safety, not easy to be influenced by the third-party component, higher in efficiency and better in expansibility.

Further, fig. 12 is a schematic flowchart of a sixth embodiment of the method for managing prometheus of the present invention, and as shown in fig. 12, the sixth embodiment of the method for managing prometheus of the present invention is proposed based on the third embodiment, in this embodiment, the step S23 specifically includes the following steps:

and step S231, sending the main instance state abnormal command to other watchers, and receiving voting information fed back by other watchers.

It should be noted that each online watchdog may initiate a failover process, when a watchdog instance confirms that a master instance is abnormal, a master instance state abnormal command may be sent to other watchers, and it is required to set itself as an initiator, and the initiator processes the failover, where the voting information is information of agreement fed back by other watchdog instances or information of negating the current watchdog as the initiator.

And step S232, when the voting information shows that the agreement proportion is larger than the preset proportion, eliminating abnormal prometheus instances by using the current watch instance as the initiator, and obtaining the rest prometheus instance clusters.

It can be understood that the preset duty ratio is a preset voting duty ratio, and after receiving the main instance state exception command, other watchers may agree or refuse to make the current watcher instance the initiator; when most of the watchdog instances agree, a failover operation is performed by the current watchdog instance that initiated at this time; the first step of the failover operation is to eliminate the previously abnormal prometheus primary instance and then determine the remaining prometheus instance clusters.

Step S233, selecting an instance with the smallest instance identifier from the remaining prometheus instance clusters as a new main instance, or selecting an instance with the largest replication offset from the remaining prometheus instance clusters as a new main instance, and taking the new main instance as a target main instance.

It should be understood that the instance with the smallest instance identifier may be selected from the remaining prometheus instance clusters as the new master instance, or the prometheus instance with the largest copy offset may be selected as the new master instance and the new master instance is taken as the target master instance, because the larger the copy offset, the more complete the data copy; after the master instance is selected, the status of each prometheus instance may be updated and data synchronization may begin.

In a specific implementation, as shown in fig. 13, fig. 13 is a flowchart of a failover in the method for managing and controlling prometheus of the present invention, referring to fig. 13, S600, and states of a prometheus master instance and states of watchers are synchronized among the watchers; the S601 watchdog instance periodically and dynamically monitors the status of the prometheus main instance (S601 to prometheus1 and prometheus2 is omitted in the figure); s602, the watchdog 0 finds out that the prometheus master instance goes offline, and the watchdog 0 notifies other watchdog instances of the master instance status. Other examples feed back whether approval is granted. This time is initiated by the watch 0 if most instances indicate agreement; s603, the watchdog 0 starts to select a master, excludes the abnormal prometheus instance, and selects an instance with a smaller identifier or an instance with a large copy offset as a master instance; and updating the state of each instance; s604, after the dispatcher selects the prometheus2 as a main instance, starting data synchronization, wherein the data synchronization is the first data synchronization operation after fault transfer; s605, acquiring configuration information of a monitoring service, namely a discovery service (SVC discover), by a proxy 2 instance; s606 promethues2 starts to collect the indexes of the monitoring service (SVC1, SVC2) and stores the indexes to the local; s604, the prometheus2 instance starts to copy the monitoring data, and the operation of copying the data from the master instance to the slave instance in a normal period is performed at the time; s607, the prometheus2 instance starts remote synchronization of data to the back end inflixdb; and S608, the client side client obtains the indexes from the time sequence database influxdb for displaying and analyzing.

According to the scheme, the main instance state abnormal command is sent to other watchers, and voting information fed back by other watchers is received; when the voting information shows that the agreement proportion is larger than the preset proportion, eliminating abnormal prometheus instances by using the current watch as an initiator to obtain a residual prometheus instance cluster; selecting an instance with the smallest instance identifier from the rest prometheus instance clusters as a new main instance, or selecting an instance with the largest copy offset from the rest prometheus instance clusters as a new main instance, and taking the new main instance as a target main instance; the method can carry out fault transfer when the main instance service is abnormal, determine a new main instance, and solve the problems of single point and inconsistent data because the whole cluster can not be off-line.

Further, fig. 14 is a schematic flowchart of a prometheus management and control method according to a seventh embodiment of the present invention, and as shown in fig. 14, the seventh embodiment of the prometheus management and control method according to the present invention is proposed based on the first embodiment, in this embodiment, the step S30 specifically includes the following steps:

and step S31, after the failover is completed, reading the configuration information of the monitoring service from the configuration server through the target main instance.

It should be noted that after the failover is completed, synchronous data replication may be started through the target master instance, that is, the states of the prometheus master instance and the states of the watchers are synchronized between the watchers, the configuration information of the monitoring service is read from the configuration server through the target master instance, and the monitoring service acquires the monitored object service to be acquired from the configuration file for the prometheus master instance.

Step S32, collecting each monitoring index of the monitoring service according to the configuration information, and copying and synchronizing each monitoring index to other prometheus instances.

It can be understood that, collecting each monitoring index of the monitoring service according to the configuration information, that is, collecting the index from the prometheus master instance, the index may be generally stored locally, and then the synchronous monitoring index is copied to other prometheus slave instances.

And step S33, writing each monitoring index into the time sequence database inflixdb for persistence.

It should be understood that, by the target master instance, remote synchronization of data to the back-end time sequence database may be started, that is, each monitoring index is written to the time sequence database infiluxdb for persistence, and generally, a local index is remotely synchronized to the back-end time sequence database infiluxdb cluster, so as to implement persistence, and the client may obtain the index from infiluxdb for display and analysis.

According to the scheme, after the failover is completed, the configuration information of the monitoring service is read from the configuration server through the target main instance; collecting each monitoring index of the monitoring service according to the configuration information, and copying and synchronizing each monitoring index to other prometheus instances; writing each monitoring index into a time sequence database infiluxdb for persistence; the method can realize remote storage through a time sequence database, realize high availability of the prometheus cluster, ensure data consistency and realize data persistence.

Correspondingly, the invention further provides a prometheus management and control device.

Referring to fig. 15, fig. 15 is a functional block diagram of a prometheus management and control apparatus according to a first embodiment of the present invention.

In a first embodiment of the prometheus management and control apparatus of the present invention, the prometheus management and control apparatus includes:

and the deployment module 10 is used for acquiring each watchdog instance from the distributed observer watchdog system, and deploying each watchdog instance as a sidecar in each monitoring alarm framework prometheus service.

And the anomaly monitoring module 20 is configured to monitor each instance in the prometheus service by using a gossip protocol, vote for failover when a failure or service anomaly of a main instance of prometheus is monitored, and determine a new main instance as a target main instance.

And the persistence module 30 is configured to collect, by the target primary instance, each monitoring index of the monitoring service after the failover is completed, and write each monitoring index into the time sequence database for persistence.

The steps implemented by each functional module of the proxy management and control device may refer to each embodiment of the proxy management and control method of the present invention, and are not described herein again.

In addition, an embodiment of the present invention further provides a storage medium, where a prometheus management program is stored on the storage medium, and when executed by a processor, the prometheus management program implements the steps of the prometheus management method embodiment described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A prometheus management and control method, comprising:

and after the fault transfer is completed, collecting each monitoring index of the monitoring service through the target main instance, and writing each monitoring index into a time sequence database for persistence.

2. The method of prometheus management and control of claim 1, wherein the obtaining of the watch instances from the distributed observer watch system and the deploying of the watch instances as a sidecar in each monitoring alarm framework prometheus service comprises:

3. The method for prometheus management and control of claim 1, wherein the monitoring of each instance in prometheus service using gossip protocol, voting for failover when a failure or service abnormality of a main instance of prometheus is monitored, and determining a new main instance as a target main instance comprises:

the target watchdog instance acquires master instance information and slave instance information from a cluster corresponding to the prometheus service periodically;

determining the current state of the main instance according to the information of the main instance and the slave instance, and judging whether the main instance has a fault or is abnormal in service according to the current state;

and sending the main instance state abnormal command to other watchers, receiving voting information, carrying out fault transfer according to the voting information, and determining a new main instance as a target main instance, wherein the other watchers are other watchers except the target watcher instance.

4. The prometheus management and control method of claim 3, wherein when a failure or service abnormality of a main instance of prometheus is monitored, generating a main instance status abnormality command comprises:

when the failure or the service abnormality of the main instance of the proxy is monitored, taking the current watch instance corresponding to the main instance as an initiator;

5. The prometheus management and control method of claim 3, wherein the sending the main instance state exception command to other watchers, receiving voting information, performing failover according to the voting information, and determining a new main instance as a target main instance comprises:

and selecting the instance with the smallest instance identifier from the rest prometheus instance clusters as a new main instance, or selecting the instance with the largest copy offset from the rest prometheus instance clusters as a new main instance, and taking the new main instance as a target main instance.

6. The prometheus management and control method of any one of claims 1-5, wherein collecting, by the primary instance, each monitoring index of a monitoring service after failover is completed, and writing each monitoring index into a time-series database for persistence, comprises:

and writing each monitoring index into the time sequence database infiluxdb for persistence.

7. A prometheus management and control device, comprising:

8. A prometheus management and control device, comprising: memory, a processor, and a prometheus management program stored on the memory and executable on the processor, the prometheus management program configured to implement the steps of the prometheus management method of any of claims 1-6.

9. A storage medium having stored thereon a prometheus management program which, when executed by a processor, implements the steps of the prometheus management method of any one of claims 1 to 6.