CN110855494B - Method for realizing high availability of agent based on distributed monitoring system - Google Patents
Method for realizing high availability of agent based on distributed monitoring system Download PDFInfo
- Publication number
- CN110855494B CN110855494B CN201911129032.0A CN201911129032A CN110855494B CN 110855494 B CN110855494 B CN 110855494B CN 201911129032 A CN201911129032 A CN 201911129032A CN 110855494 B CN110855494 B CN 110855494B
- Authority
- CN
- China
- Prior art keywords
- server
- data
- standby
- abnormal
- monitoring system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0663—Performing the actions predefined by failover planning, e.g. switching to standby network elements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0811—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0817—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1095—Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- Hardware Redundancy (AREA)
Abstract
The invention discloses a method for realizing high availability of an agent based on a distributed monitoring system, which comprises the following steps: s1: deploying a backup server in the distributed monitoring system; s2: the central server regularly monitors abnormal states of all components of the distributed monitoring system; s3: when detecting that the component serving as the main server is abnormal, starting a standby server, and replacing the abnormal component by the standby server to work as the main server; s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, and self-healing of the abnormal component is achieved. The invention sets a standby server, and realizes the switching between the main server and the standby server through state check and context data synchronization; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; manual intervention is not needed, and the operation and maintenance working efficiency is improved; high availability of the entire system is achieved by means of a single or a small number of standby servers.
Description
Technical Field
The invention relates to a high availability method of a monitoring system, in particular to a method for realizing high availability of an agent based on a distributed monitoring system.
Background
The existing distributed monitoring system is generally a monitoring technology provided for realizing load balancing and solving the problem of network isolation. In a traditional monitoring system, a central server needs to perform a series of related logic actions such as high-concurrency data acquisition, data processing and data access, and along with the increase of monitoring scale, the central server can reach a performance bottleneck, and the cost performance of capacity expansion of the central server is lower and lower.
By deploying a distributed monitoring system, all logic links such as data acquisition, data processing, data access and the like are separated: the central server is responsible for data processing, the data server is responsible for data access, and the proxy server is responsible for data acquisition. The structure can solve the monitoring pressure in a high cost performance mode, greatly improves the convenience of system capacity expansion, positions corresponding components according to performance bottlenecks at each time, and can expand the capacity of the components without influencing other components.
Another advantage of the distributed monitoring system is that network isolation is addressed. Modern network planning, for network security, typically isolates the service network from the management network. If the monitoring system deployed in the management network wants to access the IT network element of the service network, a special policy and routing opening are often required. The amount of work required to provision such ad hoc access in a large network is enormous and as networks change, it becomes increasingly difficult to maintain, and the large number of policy and routing entries can also impact network device performance. By deploying the agent monitoring server in the service network and setting it as a trusted host, and putting an exit connected to the management network for it, the configuration is obviously much simpler and more convenient to maintain.
The prior technical scheme has the following problems: each component of the distributed monitoring system, which does not implement the high availability feature, will have a serious impact on monitoring coverage when it has a single machine failure.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for realizing high availability of the agent based on a distributed monitoring system, and the high availability is realized by setting a standby server.
The technical scheme adopted by the invention for solving the technical problems is to provide a method for realizing high availability of the agent based on a distributed monitoring system, which comprises the following steps: s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning; s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components; s3: when any data server or proxy server component is detected to be abnormal, storing the data of the abnormal component, starting a standby server, and replacing the abnormal component by the standby server to work as a main server; s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.
Further, the central server is connected to the data server and the proxy server to perform state detection in real time; the standby server is connected to the data server and the proxy server.
Further, the central server periodically detects the states of the components of the distributed monitoring system through a system timer.
Further, the step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.
Further, the step S3 specifically includes: s21: the central server stores the environmental data when the abnormal proxy server is switched; s22: the central server synchronizes the environmental data to the standby server and outputs a switching log; s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.
Further, the environment data includes operation configuration parameters required by current service start, low-frequency change data generated during operation, and memory snapshots of key service data structures.
Further, the step S4 specifically includes: s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring; s32: according to time factors, carrying out data combination on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates; s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state; s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.
Compared with the prior art, the invention has the following beneficial effects: the method for realizing high availability of the agent based on the distributed monitoring system, provided by the invention, comprises the steps of setting a standby server, and realizing the switching between a main server and a standby server through state check and context data synchronization so as to realize high availability of the distributed monitoring system; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; the automatic high-availability scheme does not need manual intervention, greatly reduces the operation and maintenance workload, and improves the operation and maintenance work efficiency; a cost-effective implementation mode can realize the high availability of the whole system by depending on a single or a small number of standby servers.
Drawings
FIG. 1 is a flowchart of a method for implementing high availability of agents based on a distributed monitoring system according to an embodiment of the present invention;
fig. 2 is an architecture diagram of a distributed monitoring system according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the figures and examples.
Fig. 1 is a flowchart of a method for implementing high availability of an agent based on a distributed monitoring system in an embodiment of the present invention.
Referring to fig. 1, a method for implementing high availability of an agent based on a distributed monitoring system according to an embodiment of the present invention includes the following steps:
s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning;
s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components;
s3: when any data server or proxy server component is detected to be abnormal, storing the data of the abnormal component, starting a standby server, and replacing the abnormal component by the standby server to work as a main server;
s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.
Referring to fig. 2, a method for implementing high availability of a proxy based on a distributed monitoring system according to an embodiment of the present invention includes a central server serving as a primary server, a data server, and a plurality of proxy servers, where the central server is connected to the data server and the proxy servers for performing state detection in real time; the standby server is connected to the data server and the proxy server. The central server periodically detects the states of all the components of the distributed monitoring system through a system timer.
Specifically, in the method for implementing high availability of an agent based on a distributed monitoring system according to the embodiment of the present invention, step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP (Internet Control Message Protocol) operation; checking whether 10051 ports from the central server to the proxy server are connected or not through socket operation; when either of the two kinds of detection fails, the state of the proxy server is confirmed to be abnormal.
Specifically, the method for implementing high availability of the agent based on the distributed monitoring system in the embodiment of the present invention specifically includes:
s21: the central server stores environmental data during switching of the abnormal proxy server, wherein the environmental data comprises operation configuration parameters required by starting of the current service, low-frequency change data generated during operation and memory snapshots of a key service data structure;
s22: the central server synchronizes the environment data to the standby server and outputs a switching log;
s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.
Specifically, the method for realizing high availability of the agent based on the distributed monitoring system in the embodiment of the present invention specifically includes:
s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring;
s32: according to time factors, carrying out data combination on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates;
s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state;
s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.
In summary, the method for realizing high availability of the agent based on the distributed monitoring system provided by the invention is characterized in that a standby server is arranged, and the high availability of the distributed monitoring system is realized by realizing the switching between the standby server and the standby server through state check and context data synchronization; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; the automatic high-availability scheme does not need manual intervention, greatly reduces the operation and maintenance workload, and improves the operation and maintenance work efficiency; a cost-effective implementation mode can realize the high availability of the whole system by depending on a single or a small number of standby servers.
Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (4)
1. A method for realizing high availability of an agent based on a distributed monitoring system is characterized by comprising the following steps:
s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning;
s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components;
s3: when the central server detects that any data server or proxy server component is abnormal, the data of the abnormal component is stored, the standby server is started, and the standby server replaces the abnormal component to work as the main server;
s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component to realize self-healing of the abnormal component, and the standby server recovers to the standby state;
the step S3 specifically includes:
s21: the central server stores the environmental data when the abnormal proxy server is switched;
s22: the central server synchronizes the environment data to the standby server and outputs a switching log;
s23: switching on a change-over switch in the central server, switching the standby proxy server into a main server to undertake all service processing of the original abnormal proxy server, and marking service data generated by the standby server;
the environment data comprises operation configuration parameters required by current service starting, low-frequency change data generated during operation and memory snapshots of key service data structures;
the step S4 specifically includes:
s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring;
s32: according to time factors, data merging is carried out on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates;
s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state;
s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.
2. The method for realizing high availability of the agent based on the distributed monitoring system as claimed in claim 1, wherein the central server is connected to the data server and the agent server for real-time status detection; the standby server is connected to the data server and the proxy server.
3. The method for realizing high availability of agents based on the distributed monitoring system as claimed in claim 1, wherein the central server periodically detects the status of each component of the distributed monitoring system through a system timer.
4. The method for realizing high availability of the agent based on the distributed monitoring system according to claim 1, wherein the step S2 specifically comprises: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911129032.0A CN110855494B (en) | 2019-11-18 | 2019-11-18 | Method for realizing high availability of agent based on distributed monitoring system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911129032.0A CN110855494B (en) | 2019-11-18 | 2019-11-18 | Method for realizing high availability of agent based on distributed monitoring system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110855494A CN110855494A (en) | 2020-02-28 |
CN110855494B true CN110855494B (en) | 2022-10-04 |
Family
ID=69601993
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911129032.0A Active CN110855494B (en) | 2019-11-18 | 2019-11-18 | Method for realizing high availability of agent based on distributed monitoring system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110855494B (en) |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996111A (en) * | 2010-11-30 | 2011-03-30 | 华为技术有限公司 | Switching method, device and distributed blade server system |
CN102231681B (en) * | 2011-06-27 | 2014-07-30 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
CN102662751B (en) * | 2012-03-30 | 2016-05-11 | 浪潮电子信息产业股份有限公司 | A kind of method improving based on thermophoresis dummy machine system availability |
CN105554106A (en) * | 2015-12-15 | 2016-05-04 | 上海仪电(集团)有限公司 | Memcache distributed caching system |
CN105677516B (en) * | 2016-01-07 | 2019-11-05 | 成都市思叠科技有限公司 | A kind of back-up restoring method calculating the high efficient and reliable in storage cloud platform |
CN106945691B (en) * | 2017-04-10 | 2019-06-21 | 湖南中车时代通信信号有限公司 | The real-time hot standby switch device of the server multicenter of automatic train monitor |
CN111131146B (en) * | 2019-11-08 | 2021-04-09 | 北京航空航天大学 | Multi-supercomputing center software system deployment and incremental updating method in wide area environment |
CN111949444A (en) * | 2020-06-24 | 2020-11-17 | 武汉烽火众智数字技术有限责任公司 | Data backup and recovery system and method based on distributed service cluster |
-
2019
- 2019-11-18 CN CN201911129032.0A patent/CN110855494B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110855494A (en) | 2020-02-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8027246B2 (en) | Network system and node apparatus | |
Obadia et al. | Failover mechanisms for distributed SDN controllers | |
CN100387017C (en) | High usable self-healing Logic box fault detecting and tolerating method for constituting multi-machine system | |
US10764119B2 (en) | Link handover method for service in storage system, and storage device | |
CN101588304A (en) | Implementation method of VRRP | |
USRE45454E1 (en) | Dual-homing layer 2 switch | |
CN103780407A (en) | Gateway dynamic switching method and apparatus in distributed resilient network interconnection (DRNI) | |
CN102546222A (en) | Backup system and fault detection and processing method | |
CN104378232A (en) | Schizencephaly finding and recovering method and device under main joint and auxiliary joint cluster networking mode | |
CN105515812A (en) | Fault processing method of resources and device | |
CN103490914A (en) | Switching system and switching method for multi-machine hot standby of network application equipment | |
US11245615B2 (en) | Method for determining link state, and device | |
CN105429799A (en) | Server backup method and device | |
CN110971462A (en) | Equipment switching method, device, equipment and storage medium | |
CN108337159B (en) | Port operation control method and device | |
CN103220189A (en) | Multi-active detection (MAD) backup method and equipment | |
CN113489149B (en) | Power grid monitoring system service master node selection method based on real-time state sensing | |
CN110855494B (en) | Method for realizing high availability of agent based on distributed monitoring system | |
CN104579729A (en) | CGN (carrier-grade net address translation) single board fault informing method and device | |
US20150263884A1 (en) | Fabric switchover for systems with control plane and fabric plane on same board | |
CA3214690A1 (en) | Passive optical network for utility infrastructure resiliency | |
CN102546313B (en) | Multi-activation detection method and multi-activation detection device | |
CN109101372A (en) | Redundancy switching method, storage medium and the Shelf Management Module of Shelf Management Module | |
Rao et al. | High availability and load balancing in SDN controllers | |
KR100713072B1 (en) | Duplexed softswitch system and method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |