CN110855494A - Method for realizing high availability of agent based on distributed monitoring system - Google Patents
Method for realizing high availability of agent based on distributed monitoring system Download PDFInfo
- Publication number
- CN110855494A CN110855494A CN201911129032.0A CN201911129032A CN110855494A CN 110855494 A CN110855494 A CN 110855494A CN 201911129032 A CN201911129032 A CN 201911129032A CN 110855494 A CN110855494 A CN 110855494A
- Authority
- CN
- China
- Prior art keywords
- server
- data
- monitoring system
- standby
- abnormal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0663—Performing the actions predefined by failover planning, e.g. switching to standby network elements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0811—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0817—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1095—Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
Abstract
The invention discloses a method for realizing high availability of an agent based on a distributed monitoring system, which comprises the following steps: s1: deploying a standby server in the distributed monitoring system; s2: the central server regularly monitors each component of the distributed monitoring system for abnormal states; s3: when detecting that the component serving as the main server is abnormal, starting a standby server, and replacing the abnormal component by the standby server to work as the main server; s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, and self-healing of the abnormal component is achieved. The invention sets a standby server, and realizes the switching between the main server and the standby server through state check and context data synchronization; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; manual intervention is not needed, and the operation and maintenance work efficiency is improved; high availability of the entire system is achieved by means of a single or a small number of standby servers.
Description
Technical Field
The invention relates to a high availability method of a monitoring system, in particular to a method for realizing high availability of an agent based on a distributed monitoring system.
Background
The existing distributed monitoring system is generally a monitoring technology provided for realizing load balancing and solving the problem of network isolation. In a traditional monitoring system, a central server needs to perform a series of related logic actions such as high-concurrency data acquisition, data processing and data access, and along with the increase of monitoring scale, the central server can reach a performance bottleneck, and the cost performance of capacity expansion of the central server is lower and lower.
By deploying a distributed monitoring system, all logic links such as data acquisition, data processing, data access and the like are separated: the central server is responsible for data processing, the data server is responsible for data access, and the proxy server is responsible for data acquisition. The structure can solve the monitoring pressure in a high cost performance mode, greatly improves the convenience of system capacity expansion, positions corresponding components according to performance bottlenecks at each time, and can expand the capacity of the components without influencing other components.
Another advantage of the distributed monitoring system is that network isolation is addressed. Modern network planning typically isolates the service network from the management network for network security. If IT is required to access an IT network element of a service network, a special policy and routing opening are often required to be performed in a monitoring system deployed in a management network. The amount of work required to provision such ad hoc access in a large network is enormous and as networks change, it becomes increasingly difficult to maintain, and the large number of policy and routing entries can also impact network device performance. By deploying the agent monitoring server in the service network and setting it as a trusted host, and putting an exit connected to the management network for it, the configuration is obviously much simpler and more convenient to maintain.
The prior technical scheme has the following problems: each component of the distributed monitoring system, which does not implement the high availability feature, will have a serious impact on monitoring coverage when it has a single machine failure.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for realizing high availability of the agent based on a distributed monitoring system, and the high availability is realized by setting a standby server.
The technical scheme adopted by the invention for solving the technical problems is to provide a method for realizing high availability of the agent based on a distributed monitoring system, which comprises the following steps: s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning; s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components; s3: when any data server or proxy server component is detected to be abnormal, storing the data of the abnormal component, starting a standby server, and replacing the abnormal component by the standby server to work as a main server; s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.
Further, the central server is connected to the data server and the proxy server to perform state detection in real time; the standby server is connected to the data server and the proxy server.
Further, the central server periodically detects the states of the components of the distributed monitoring system through a system timer.
Further, the step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.
Further, the step S3 specifically includes: s21: the central server stores the environmental data when the abnormal proxy server is switched; s22: the central server synchronizes the environment data to the standby server and outputs a switching log; s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.
Further, the environment data includes operation configuration parameters required by current service start, low-frequency change data generated during operation, and memory snapshots of key service data structures.
Further, the step S4 specifically includes: s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring; s32: according to time factors, data merging is carried out on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates; s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state; s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.
Compared with the prior art, the invention has the following beneficial effects: the method for realizing high availability of the agent based on the distributed monitoring system, provided by the invention, comprises the steps of setting a standby server, and realizing the switching between a main server and a standby server through state check and context data synchronization so as to realize high availability of the distributed monitoring system; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; the automatic high-availability scheme does not need manual intervention, greatly reduces the operation and maintenance workload, and improves the operation and maintenance work efficiency; a cost-effective implementation mode can realize the high availability of the whole system by depending on a single or a small number of standby servers.
Drawings
FIG. 1 is a flowchart of a method for implementing high availability of agents based on a distributed monitoring system according to an embodiment of the present invention;
fig. 2 is an architecture diagram of a distributed monitoring system according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the figures and examples.
Fig. 1 is a flowchart of a method for implementing high availability of an agent based on a distributed monitoring system in an embodiment of the present invention.
Referring to fig. 1, a method for implementing high availability of an agent based on a distributed monitoring system according to an embodiment of the present invention includes the following steps:
s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning;
s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components;
s3: when any data server or proxy server component is detected to be abnormal, storing the data of the abnormal component, starting a standby server, and replacing the abnormal component by the standby server to work as a main server;
s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.
Referring to fig. 2, a method for implementing high availability of a proxy based on a distributed monitoring system according to an embodiment of the present invention includes a central server serving as a primary server, a data server, and a plurality of proxy servers, where the central server is connected to the data server and the proxy servers for performing state detection in real time; the standby server is connected to the data server and the proxy server. The central server periodically detects the states of all the components of the distributed monitoring system through a system timer.
Specifically, in the method for implementing high availability of an agent based on a distributed monitoring system in the embodiment of the present invention, step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP (Internet Control Message Protocol) operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.
Specifically, the method for implementing high availability of the agent based on the distributed monitoring system in the embodiment of the present invention specifically includes:
s21: the central server stores environmental data during switching of the abnormal proxy server, wherein the environmental data comprises operation configuration parameters required by starting of the current service, low-frequency change data generated during operation and memory snapshots of a key service data structure;
s22: the central server synchronizes the environment data to the standby server and outputs a switching log;
s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.
Specifically, the method for realizing high availability of the agent based on the distributed monitoring system in the embodiment of the present invention specifically includes:
s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring;
s32: according to time factors, data merging is carried out on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates;
s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state;
s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.
In summary, the method for realizing high availability of the agent based on the distributed monitoring system provided by the invention is characterized in that the standby server is arranged, and the high availability of the distributed monitoring system is realized by state check and context data synchronization; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; the automatic high-availability scheme does not need manual intervention, greatly reduces the operation and maintenance workload, and improves the operation and maintenance work efficiency; a cost-effective implementation mode can realize the high availability of the whole system by depending on a single or a small number of standby servers.
Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (7)
1. A method for realizing high availability of an agent based on a distributed monitoring system is characterized by comprising the following steps:
s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning;
s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components;
s3: when the central server detects that any data server or proxy server component is abnormal, the data of the abnormal component is stored, the standby server is started, and the standby server replaces the abnormal component to work as the main server;
s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.
2. The method for realizing high availability of the agent based on the distributed monitoring system as claimed in claim 1, wherein the central server is connected to the data server and the agent server for real-time status detection; the standby server is connected to the data server and the proxy server.
3. The method for realizing high availability of agents based on the distributed monitoring system as claimed in claim 1, wherein the central server periodically detects the status of each component of the distributed monitoring system through a system timer.
4. The method for realizing high availability of agents based on the distributed monitoring system according to claim 1, wherein the step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.
5. The method for realizing high availability of agents based on the distributed monitoring system according to claim 1, wherein the step S3 specifically includes:
s21: the central server stores the environmental data when the abnormal proxy server is switched;
s22: the central server synchronizes the environment data to the standby server and outputs a switching log;
s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.
6. The method for achieving high agent availability based on the distributed monitoring system of claim 5, wherein the environment data comprises running configuration parameters required by current service startup, low-frequency change data generated in running and memory snapshots of key business data structures.
7. The method for realizing high availability of agents based on the distributed monitoring system according to claim 6, wherein the step S4 specifically includes:
s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring;
s32: according to time factors, data merging is carried out on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates;
s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state;
s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911129032.0A CN110855494B (en) | 2019-11-18 | 2019-11-18 | Method for realizing high availability of agent based on distributed monitoring system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911129032.0A CN110855494B (en) | 2019-11-18 | 2019-11-18 | Method for realizing high availability of agent based on distributed monitoring system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110855494A true CN110855494A (en) | 2020-02-28 |
CN110855494B CN110855494B (en) | 2022-10-04 |
Family
ID=69601993
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911129032.0A Active CN110855494B (en) | 2019-11-18 | 2019-11-18 | Method for realizing high availability of agent based on distributed monitoring system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110855494B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996111A (en) * | 2010-11-30 | 2011-03-30 | 华为技术有限公司 | Switching method, device and distributed blade server system |
CN102231681A (en) * | 2011-06-27 | 2011-11-02 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
CN102662751A (en) * | 2012-03-30 | 2012-09-12 | 浪潮电子信息产业股份有限公司 | Method for improving availability of virtual machine system based on thermomigration |
CN105554106A (en) * | 2015-12-15 | 2016-05-04 | 上海仪电(集团)有限公司 | Memcache distributed caching system |
CN105677516A (en) * | 2016-01-07 | 2016-06-15 | 成都市思叠科技有限公司 | Method for efficient and reliable backup recovery in calculation approach storage cloud platform |
CN106945691A (en) * | 2017-04-10 | 2017-07-14 | 湖南中车时代通信信号有限公司 | The real-time hot standby switch device of server multicenter of automatic train monitor |
CN111131146A (en) * | 2019-11-08 | 2020-05-08 | 北京航空航天大学 | Multi-supercomputing center software system deployment and incremental updating method in wide area environment |
CN111949444A (en) * | 2020-06-24 | 2020-11-17 | 武汉烽火众智数字技术有限责任公司 | Data backup and recovery system and method based on distributed service cluster |
-
2019
- 2019-11-18 CN CN201911129032.0A patent/CN110855494B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996111A (en) * | 2010-11-30 | 2011-03-30 | 华为技术有限公司 | Switching method, device and distributed blade server system |
CN102231681A (en) * | 2011-06-27 | 2011-11-02 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
CN102662751A (en) * | 2012-03-30 | 2012-09-12 | 浪潮电子信息产业股份有限公司 | Method for improving availability of virtual machine system based on thermomigration |
CN105554106A (en) * | 2015-12-15 | 2016-05-04 | 上海仪电(集团)有限公司 | Memcache distributed caching system |
CN105677516A (en) * | 2016-01-07 | 2016-06-15 | 成都市思叠科技有限公司 | Method for efficient and reliable backup recovery in calculation approach storage cloud platform |
CN106945691A (en) * | 2017-04-10 | 2017-07-14 | 湖南中车时代通信信号有限公司 | The real-time hot standby switch device of server multicenter of automatic train monitor |
CN111131146A (en) * | 2019-11-08 | 2020-05-08 | 北京航空航天大学 | Multi-supercomputing center software system deployment and incremental updating method in wide area environment |
CN111949444A (en) * | 2020-06-24 | 2020-11-17 | 武汉烽火众智数字技术有限责任公司 | Data backup and recovery system and method based on distributed service cluster |
Also Published As
Publication number | Publication date |
---|---|
CN110855494B (en) | 2022-10-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110224871B (en) | High-availability method and device for Redis cluster | |
Obadia et al. | Failover mechanisms for distributed SDN controllers | |
CN101588304B (en) | Implementation method of VRRP and device | |
CN100387017C (en) | High usable self-healing Logic box fault detecting and tolerating method for constituting multi-machine system | |
US20190081853A1 (en) | Link Handover Method for Service in Storage System, and Storage Device | |
KR20070026327A (en) | Redundant routing capabilities for a network node cluster | |
USRE45454E1 (en) | Dual-homing layer 2 switch | |
CN103780407A (en) | Gateway dynamic switching method and apparatus in distributed resilient network interconnection (DRNI) | |
CN105515812A (en) | Fault processing method of resources and device | |
CN104378232A (en) | Schizencephaly finding and recovering method and device under main joint and auxiliary joint cluster networking mode | |
US20150113313A1 (en) | Method of operating a server system with high availability | |
CN103490914A (en) | Switching system and switching method for multi-machine hot standby of network application equipment | |
CN105429799A (en) | Server backup method and device | |
CN110971462A (en) | Equipment switching method, device, equipment and storage medium | |
CN113489149B (en) | Power grid monitoring system service master node selection method based on real-time state sensing | |
CN103220189A (en) | Multi-active detection (MAD) backup method and equipment | |
CN110855494B (en) | Method for realizing high availability of agent based on distributed monitoring system | |
CN104780067A (en) | Method and device for rebooting PE (port extender) | |
CN110677288A (en) | Edge computing system and method generally used for multi-scene deployment | |
CN102546313B (en) | Multi-activation detection method and multi-activation detection device | |
CN109101372A (en) | Redundancy switching method, storage medium and the Shelf Management Module of Shelf Management Module | |
CA3214690A1 (en) | Passive optical network for utility infrastructure resiliency | |
CN112131201B (en) | Method, system, equipment and medium for high availability of network additional storage | |
CN106130783B (en) | Port fault processing method and device | |
CN105871524B (en) | A kind of method and system based on TIPC protocol realization two-node cluster hot backup |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |