CN110855494B

CN110855494B - Method for realizing high availability of agent based on distributed monitoring system

Info

Publication number: CN110855494B
Application number: CN201911129032.0A
Authority: CN
Inventors: 程永新; 宋辉
Original assignee: Shanghai New Torch Network Information Technology Ltd By Share Ltd
Current assignee: Shanghai New Torch Network Information Technology Ltd By Share Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2022-10-04
Anticipated expiration: 2039-11-18
Also published as: CN110855494A

Abstract

The invention discloses a method for realizing high availability of an agent based on a distributed monitoring system, which comprises the following steps: s1: deploying a backup server in the distributed monitoring system; s2: the central server regularly monitors abnormal states of all components of the distributed monitoring system; s3: when detecting that the component serving as the main server is abnormal, starting a standby server, and replacing the abnormal component by the standby server to work as the main server; s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, and self-healing of the abnormal component is achieved. The invention sets a standby server, and realizes the switching between the main server and the standby server through state check and context data synchronization; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; manual intervention is not needed, and the operation and maintenance working efficiency is improved; high availability of the entire system is achieved by means of a single or a small number of standby servers.

Description

Method for realizing high availability of agent based on distributed monitoring system

Technical Field

The invention relates to a high availability method of a monitoring system, in particular to a method for realizing high availability of an agent based on a distributed monitoring system.

Background

The existing distributed monitoring system is generally a monitoring technology provided for realizing load balancing and solving the problem of network isolation. In a traditional monitoring system, a central server needs to perform a series of related logic actions such as high-concurrency data acquisition, data processing and data access, and along with the increase of monitoring scale, the central server can reach a performance bottleneck, and the cost performance of capacity expansion of the central server is lower and lower.

By deploying a distributed monitoring system, all logic links such as data acquisition, data processing, data access and the like are separated: the central server is responsible for data processing, the data server is responsible for data access, and the proxy server is responsible for data acquisition. The structure can solve the monitoring pressure in a high cost performance mode, greatly improves the convenience of system capacity expansion, positions corresponding components according to performance bottlenecks at each time, and can expand the capacity of the components without influencing other components.

Another advantage of the distributed monitoring system is that network isolation is addressed. Modern network planning, for network security, typically isolates the service network from the management network. If the monitoring system deployed in the management network wants to access the IT network element of the service network, a special policy and routing opening are often required. The amount of work required to provision such ad hoc access in a large network is enormous and as networks change, it becomes increasingly difficult to maintain, and the large number of policy and routing entries can also impact network device performance. By deploying the agent monitoring server in the service network and setting it as a trusted host, and putting an exit connected to the management network for it, the configuration is obviously much simpler and more convenient to maintain.

The prior technical scheme has the following problems: each component of the distributed monitoring system, which does not implement the high availability feature, will have a serious impact on monitoring coverage when it has a single machine failure.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for realizing high availability of the agent based on a distributed monitoring system, and the high availability is realized by setting a standby server.

The technical scheme adopted by the invention for solving the technical problems is to provide a method for realizing high availability of the agent based on a distributed monitoring system, which comprises the following steps: s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning; s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components; s3: when any data server or proxy server component is detected to be abnormal, storing the data of the abnormal component, starting a standby server, and replacing the abnormal component by the standby server to work as a main server; s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.

Further, the central server is connected to the data server and the proxy server to perform state detection in real time; the standby server is connected to the data server and the proxy server.

Further, the central server periodically detects the states of the components of the distributed monitoring system through a system timer.

Further, the step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.

Further, the step S3 specifically includes: s21: the central server stores the environmental data when the abnormal proxy server is switched; s22: the central server synchronizes the environmental data to the standby server and outputs a switching log; s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.

Further, the environment data includes operation configuration parameters required by current service start, low-frequency change data generated during operation, and memory snapshots of key service data structures.

Further, the step S4 specifically includes: s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring; s32: according to time factors, carrying out data combination on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates; s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state; s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.

Compared with the prior art, the invention has the following beneficial effects: the method for realizing high availability of the agent based on the distributed monitoring system, provided by the invention, comprises the steps of setting a standby server, and realizing the switching between a main server and a standby server through state check and context data synchronization so as to realize high availability of the distributed monitoring system; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; the automatic high-availability scheme does not need manual intervention, greatly reduces the operation and maintenance workload, and improves the operation and maintenance work efficiency; a cost-effective implementation mode can realize the high availability of the whole system by depending on a single or a small number of standby servers.

Drawings

FIG. 1 is a flowchart of a method for implementing high availability of agents based on a distributed monitoring system according to an embodiment of the present invention;

fig. 2 is an architecture diagram of a distributed monitoring system according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the figures and examples.

Fig. 1 is a flowchart of a method for implementing high availability of an agent based on a distributed monitoring system in an embodiment of the present invention.

Referring to fig. 1, a method for implementing high availability of an agent based on a distributed monitoring system according to an embodiment of the present invention includes the following steps:

s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning;

s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components;

s3: when any data server or proxy server component is detected to be abnormal, storing the data of the abnormal component, starting a standby server, and replacing the abnormal component by the standby server to work as a main server;

s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.

Referring to fig. 2, a method for implementing high availability of a proxy based on a distributed monitoring system according to an embodiment of the present invention includes a central server serving as a primary server, a data server, and a plurality of proxy servers, where the central server is connected to the data server and the proxy servers for performing state detection in real time; the standby server is connected to the data server and the proxy server. The central server periodically detects the states of all the components of the distributed monitoring system through a system timer.

Specifically, in the method for implementing high availability of an agent based on a distributed monitoring system according to the embodiment of the present invention, step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP (Internet Control Message Protocol) operation; checking whether 10051 ports from the central server to the proxy server are connected or not through socket operation; when either of the two kinds of detection fails, the state of the proxy server is confirmed to be abnormal.

Specifically, the method for implementing high availability of the agent based on the distributed monitoring system in the embodiment of the present invention specifically includes:

s21: the central server stores environmental data during switching of the abnormal proxy server, wherein the environmental data comprises operation configuration parameters required by starting of the current service, low-frequency change data generated during operation and memory snapshots of a key service data structure;

s22: the central server synchronizes the environment data to the standby server and outputs a switching log;

s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.

Specifically, the method for realizing high availability of the agent based on the distributed monitoring system in the embodiment of the present invention specifically includes:

s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring;

s32: according to time factors, carrying out data combination on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates;

s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state;

s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.

In summary, the method for realizing high availability of the agent based on the distributed monitoring system provided by the invention is characterized in that a standby server is arranged, and the high availability of the distributed monitoring system is realized by realizing the switching between the standby server and the standby server through state check and context data synchronization; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; the automatic high-availability scheme does not need manual intervention, greatly reduces the operation and maintenance workload, and improves the operation and maintenance work efficiency; a cost-effective implementation mode can realize the high availability of the whole system by depending on a single or a small number of standby servers.

Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for realizing high availability of an agent based on a distributed monitoring system is characterized by comprising the following steps:

s3: when the central server detects that any data server or proxy server component is abnormal, the data of the abnormal component is stored, the standby server is started, and the standby server replaces the abnormal component to work as the main server;

s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component to realize self-healing of the abnormal component, and the standby server recovers to the standby state;

the step S3 specifically includes:

s21: the central server stores the environmental data when the abnormal proxy server is switched;

s23: switching on a change-over switch in the central server, switching the standby proxy server into a main server to undertake all service processing of the original abnormal proxy server, and marking service data generated by the standby server;

the environment data comprises operation configuration parameters required by current service starting, low-frequency change data generated during operation and memory snapshots of key service data structures;

the step S4 specifically includes:

s32: according to time factors, data merging is carried out on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates;

2. The method for realizing high availability of the agent based on the distributed monitoring system as claimed in claim 1, wherein the central server is connected to the data server and the agent server for real-time status detection; the standby server is connected to the data server and the proxy server.

3. The method for realizing high availability of agents based on the distributed monitoring system as claimed in claim 1, wherein the central server periodically detects the status of each component of the distributed monitoring system through a system timer.

4. The method for realizing high availability of the agent based on the distributed monitoring system according to claim 1, wherein the step S2 specifically comprises: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.