CN110855494A - Method for realizing high availability of agent based on distributed monitoring system - Google Patents

Method for realizing high availability of agent based on distributed monitoring system Download PDF

Info

Publication number
CN110855494A
CN110855494A CN201911129032.0A CN201911129032A CN110855494A CN 110855494 A CN110855494 A CN 110855494A CN 201911129032 A CN201911129032 A CN 201911129032A CN 110855494 A CN110855494 A CN 110855494A
Authority
CN
China
Prior art keywords
server
data
monitoring system
standby
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911129032.0A
Other languages
Chinese (zh)
Other versions
CN110855494B (en
Inventor
程永新
宋辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai New Torch Network Information Technology Ltd By Share Ltd
Original Assignee
Shanghai New Torch Network Information Technology Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai New Torch Network Information Technology Ltd By Share Ltd filed Critical Shanghai New Torch Network Information Technology Ltd By Share Ltd
Priority to CN201911129032.0A priority Critical patent/CN110855494B/en
Publication of CN110855494A publication Critical patent/CN110855494A/en
Application granted granted Critical
Publication of CN110855494B publication Critical patent/CN110855494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services

Abstract

The invention discloses a method for realizing high availability of an agent based on a distributed monitoring system, which comprises the following steps: s1: deploying a standby server in the distributed monitoring system; s2: the central server regularly monitors each component of the distributed monitoring system for abnormal states; s3: when detecting that the component serving as the main server is abnormal, starting a standby server, and replacing the abnormal component by the standby server to work as the main server; s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, and self-healing of the abnormal component is achieved. The invention sets a standby server, and realizes the switching between the main server and the standby server through state check and context data synchronization; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; manual intervention is not needed, and the operation and maintenance work efficiency is improved; high availability of the entire system is achieved by means of a single or a small number of standby servers.

Description

Method for realizing high availability of agent based on distributed monitoring system
Technical Field
The invention relates to a high availability method of a monitoring system, in particular to a method for realizing high availability of an agent based on a distributed monitoring system.
Background
The existing distributed monitoring system is generally a monitoring technology provided for realizing load balancing and solving the problem of network isolation. In a traditional monitoring system, a central server needs to perform a series of related logic actions such as high-concurrency data acquisition, data processing and data access, and along with the increase of monitoring scale, the central server can reach a performance bottleneck, and the cost performance of capacity expansion of the central server is lower and lower.
By deploying a distributed monitoring system, all logic links such as data acquisition, data processing, data access and the like are separated: the central server is responsible for data processing, the data server is responsible for data access, and the proxy server is responsible for data acquisition. The structure can solve the monitoring pressure in a high cost performance mode, greatly improves the convenience of system capacity expansion, positions corresponding components according to performance bottlenecks at each time, and can expand the capacity of the components without influencing other components.
Another advantage of the distributed monitoring system is that network isolation is addressed. Modern network planning typically isolates the service network from the management network for network security. If IT is required to access an IT network element of a service network, a special policy and routing opening are often required to be performed in a monitoring system deployed in a management network. The amount of work required to provision such ad hoc access in a large network is enormous and as networks change, it becomes increasingly difficult to maintain, and the large number of policy and routing entries can also impact network device performance. By deploying the agent monitoring server in the service network and setting it as a trusted host, and putting an exit connected to the management network for it, the configuration is obviously much simpler and more convenient to maintain.
The prior technical scheme has the following problems: each component of the distributed monitoring system, which does not implement the high availability feature, will have a serious impact on monitoring coverage when it has a single machine failure.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for realizing high availability of the agent based on a distributed monitoring system, and the high availability is realized by setting a standby server.
The technical scheme adopted by the invention for solving the technical problems is to provide a method for realizing high availability of the agent based on a distributed monitoring system, which comprises the following steps: s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning; s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components; s3: when any data server or proxy server component is detected to be abnormal, storing the data of the abnormal component, starting a standby server, and replacing the abnormal component by the standby server to work as a main server; s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.
Further, the central server is connected to the data server and the proxy server to perform state detection in real time; the standby server is connected to the data server and the proxy server.
Further, the central server periodically detects the states of the components of the distributed monitoring system through a system timer.
Further, the step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.
Further, the step S3 specifically includes: s21: the central server stores the environmental data when the abnormal proxy server is switched; s22: the central server synchronizes the environment data to the standby server and outputs a switching log; s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.
Further, the environment data includes operation configuration parameters required by current service start, low-frequency change data generated during operation, and memory snapshots of key service data structures.
Further, the step S4 specifically includes: s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring; s32: according to time factors, data merging is carried out on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates; s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state; s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.
Compared with the prior art, the invention has the following beneficial effects: the method for realizing high availability of the agent based on the distributed monitoring system, provided by the invention, comprises the steps of setting a standby server, and realizing the switching between a main server and a standby server through state check and context data synchronization so as to realize high availability of the distributed monitoring system; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; the automatic high-availability scheme does not need manual intervention, greatly reduces the operation and maintenance workload, and improves the operation and maintenance work efficiency; a cost-effective implementation mode can realize the high availability of the whole system by depending on a single or a small number of standby servers.
Drawings
FIG. 1 is a flowchart of a method for implementing high availability of agents based on a distributed monitoring system according to an embodiment of the present invention;
fig. 2 is an architecture diagram of a distributed monitoring system according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the figures and examples.
Fig. 1 is a flowchart of a method for implementing high availability of an agent based on a distributed monitoring system in an embodiment of the present invention.
Referring to fig. 1, a method for implementing high availability of an agent based on a distributed monitoring system according to an embodiment of the present invention includes the following steps:
s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning;
s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components;
s3: when any data server or proxy server component is detected to be abnormal, storing the data of the abnormal component, starting a standby server, and replacing the abnormal component by the standby server to work as a main server;
s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.
Referring to fig. 2, a method for implementing high availability of a proxy based on a distributed monitoring system according to an embodiment of the present invention includes a central server serving as a primary server, a data server, and a plurality of proxy servers, where the central server is connected to the data server and the proxy servers for performing state detection in real time; the standby server is connected to the data server and the proxy server. The central server periodically detects the states of all the components of the distributed monitoring system through a system timer.
Specifically, in the method for implementing high availability of an agent based on a distributed monitoring system in the embodiment of the present invention, step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP (Internet Control Message Protocol) operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.
Specifically, the method for implementing high availability of the agent based on the distributed monitoring system in the embodiment of the present invention specifically includes:
s21: the central server stores environmental data during switching of the abnormal proxy server, wherein the environmental data comprises operation configuration parameters required by starting of the current service, low-frequency change data generated during operation and memory snapshots of a key service data structure;
s22: the central server synchronizes the environment data to the standby server and outputs a switching log;
s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.
Specifically, the method for realizing high availability of the agent based on the distributed monitoring system in the embodiment of the present invention specifically includes:
s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring;
s32: according to time factors, data merging is carried out on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates;
s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state;
s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.
In summary, the method for realizing high availability of the agent based on the distributed monitoring system provided by the invention is characterized in that the standby server is arranged, and the high availability of the distributed monitoring system is realized by state check and context data synchronization; the system can still stably and reliably operate when a single point of failure occurs in the distributed monitoring system; the automatic high-availability scheme does not need manual intervention, greatly reduces the operation and maintenance workload, and improves the operation and maintenance work efficiency; a cost-effective implementation mode can realize the high availability of the whole system by depending on a single or a small number of standby servers.
Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A method for realizing high availability of an agent based on a distributed monitoring system is characterized by comprising the following steps:
s1: deploying a standby server in a distributed monitoring system, wherein the distributed monitoring system comprises a central server, a data server and a plurality of proxy servers; the central server selects a data server and a plurality of proxy servers as a main server at the beginning;
s2: the central server regularly monitors abnormal states of the data server and the plurality of proxy server components;
s3: when the central server detects that any data server or proxy server component is abnormal, the data of the abnormal component is stored, the standby server is started, and the standby server replaces the abnormal component to work as the main server;
s4: when the abnormal component recovers the available state, the standby server submits the synchronous data to the recovered abnormal component, self-healing of the abnormal component is achieved, and the standby server recovers to the standby state.
2. The method for realizing high availability of the agent based on the distributed monitoring system as claimed in claim 1, wherein the central server is connected to the data server and the agent server for real-time status detection; the standby server is connected to the data server and the proxy server.
3. The method for realizing high availability of agents based on the distributed monitoring system as claimed in claim 1, wherein the central server periodically detects the status of each component of the distributed monitoring system through a system timer.
4. The method for realizing high availability of agents based on the distributed monitoring system according to claim 1, wherein the step S2 specifically includes: deploying a detection program in the central server, and judging whether the proxy server works normally through IP accessibility detection and connectivity detection of a TCP service interface: checking whether the network from the central server to the proxy server is smooth through ICMP operation; checking whether a 10051 port from the central server to the proxy server is communicated or not through socket operation; when the detection fails in any one of the two detections, the state abnormality of the proxy server is confirmed.
5. The method for realizing high availability of agents based on the distributed monitoring system according to claim 1, wherein the step S3 specifically includes:
s21: the central server stores the environmental data when the abnormal proxy server is switched;
s22: the central server synchronizes the environment data to the standby server and outputs a switching log;
s23: and opening a change-over switch in the central server, switching the standby proxy server into the main server to bear all service processing of the original abnormal proxy server, and marking service data generated by the standby server.
6. The method for achieving high agent availability based on the distributed monitoring system of claim 5, wherein the environment data comprises running configuration parameters required by current service startup, low-frequency change data generated in running and memory snapshots of key business data structures.
7. The method for realizing high availability of agents based on the distributed monitoring system according to claim 6, wherein the step S4 specifically includes:
s31: the central server confirms that the original abnormal proxy server recovers the available state through abnormal state monitoring;
s32: according to time factors, data merging is carried out on the environment data reported and stored when the standby server is switched and the marking data generated when the standby server operates;
s33: the central server stops the standby server and synchronizes the merged data to the original abnormal proxy server which recovers the available state;
s34: and starting the proxy server and switching to the state of the main server according to the running configuration parameters required by the current service starting in the environment data.
CN201911129032.0A 2019-11-18 2019-11-18 Method for realizing high availability of agent based on distributed monitoring system Active CN110855494B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911129032.0A CN110855494B (en) 2019-11-18 2019-11-18 Method for realizing high availability of agent based on distributed monitoring system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911129032.0A CN110855494B (en) 2019-11-18 2019-11-18 Method for realizing high availability of agent based on distributed monitoring system

Publications (2)

Publication Number Publication Date
CN110855494A true CN110855494A (en) 2020-02-28
CN110855494B CN110855494B (en) 2022-10-04

Family

ID=69601993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911129032.0A Active CN110855494B (en) 2019-11-18 2019-11-18 Method for realizing high availability of agent based on distributed monitoring system

Country Status (1)

Country Link
CN (1) CN110855494B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996111A (en) * 2010-11-30 2011-03-30 华为技术有限公司 Switching method, device and distributed blade server system
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof
CN102662751A (en) * 2012-03-30 2012-09-12 浪潮电子信息产业股份有限公司 Method for improving availability of virtual machine system based on thermomigration
CN105554106A (en) * 2015-12-15 2016-05-04 上海仪电(集团)有限公司 Memcache distributed caching system
CN105677516A (en) * 2016-01-07 2016-06-15 成都市思叠科技有限公司 Method for efficient and reliable backup recovery in calculation approach storage cloud platform
CN106945691A (en) * 2017-04-10 2017-07-14 湖南中车时代通信信号有限公司 The real-time hot standby switch device of server multicenter of automatic train monitor
CN111131146A (en) * 2019-11-08 2020-05-08 北京航空航天大学 Multi-supercomputing center software system deployment and incremental updating method in wide area environment
CN111949444A (en) * 2020-06-24 2020-11-17 武汉烽火众智数字技术有限责任公司 Data backup and recovery system and method based on distributed service cluster

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996111A (en) * 2010-11-30 2011-03-30 华为技术有限公司 Switching method, device and distributed blade server system
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof
CN102662751A (en) * 2012-03-30 2012-09-12 浪潮电子信息产业股份有限公司 Method for improving availability of virtual machine system based on thermomigration
CN105554106A (en) * 2015-12-15 2016-05-04 上海仪电(集团)有限公司 Memcache distributed caching system
CN105677516A (en) * 2016-01-07 2016-06-15 成都市思叠科技有限公司 Method for efficient and reliable backup recovery in calculation approach storage cloud platform
CN106945691A (en) * 2017-04-10 2017-07-14 湖南中车时代通信信号有限公司 The real-time hot standby switch device of server multicenter of automatic train monitor
CN111131146A (en) * 2019-11-08 2020-05-08 北京航空航天大学 Multi-supercomputing center software system deployment and incremental updating method in wide area environment
CN111949444A (en) * 2020-06-24 2020-11-17 武汉烽火众智数字技术有限责任公司 Data backup and recovery system and method based on distributed service cluster

Also Published As

Publication number Publication date
CN110855494B (en) 2022-10-04

Similar Documents

Publication Publication Date Title
CN110224871B (en) High-availability method and device for Redis cluster
Obadia et al. Failover mechanisms for distributed SDN controllers
CN101588304B (en) Implementation method of VRRP and device
CN100387017C (en) High usable self-healing Logic box fault detecting and tolerating method for constituting multi-machine system
US20190081853A1 (en) Link Handover Method for Service in Storage System, and Storage Device
KR20070026327A (en) Redundant routing capabilities for a network node cluster
USRE45454E1 (en) Dual-homing layer 2 switch
CN103780407A (en) Gateway dynamic switching method and apparatus in distributed resilient network interconnection (DRNI)
CN105515812A (en) Fault processing method of resources and device
CN104378232A (en) Schizencephaly finding and recovering method and device under main joint and auxiliary joint cluster networking mode
US20150113313A1 (en) Method of operating a server system with high availability
CN103490914A (en) Switching system and switching method for multi-machine hot standby of network application equipment
CN105429799A (en) Server backup method and device
CN110971462A (en) Equipment switching method, device, equipment and storage medium
CN113489149B (en) Power grid monitoring system service master node selection method based on real-time state sensing
CN103220189A (en) Multi-active detection (MAD) backup method and equipment
CN110855494B (en) Method for realizing high availability of agent based on distributed monitoring system
CN104780067A (en) Method and device for rebooting PE (port extender)
CN110677288A (en) Edge computing system and method generally used for multi-scene deployment
CN102546313B (en) Multi-activation detection method and multi-activation detection device
CN109101372A (en) Redundancy switching method, storage medium and the Shelf Management Module of Shelf Management Module
CA3214690A1 (en) Passive optical network for utility infrastructure resiliency
CN112131201B (en) Method, system, equipment and medium for high availability of network additional storage
CN106130783B (en) Port fault processing method and device
CN105871524B (en) A kind of method and system based on TIPC protocol realization two-node cluster hot backup

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant