CN118175003A

CN118175003A - Method, device, equipment and medium for real-time round-robin monitoring of mass network equipment

Info

Publication number: CN118175003A
Application number: CN202410371288.7A
Authority: CN
Inventors: 韦东君; 梁骥; 卢惠勤
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2024-03-28
Filing date: 2024-03-28
Publication date: 2024-06-11

Abstract

The embodiment of the invention provides a method, a device, equipment and a medium for real-time round robin monitoring of mass network equipment. The method comprises the following steps: periodically acquiring a target network equipment list to be monitored; collecting network monitoring data of target network equipment every first round of polling period; if the continuous N times of key index data exceed the first preset index threshold value or the survivability detection data are larger than the first preset detection threshold value, adding the continuous N times of key index data into a hidden danger equipment list; collecting network monitoring data of hidden trouble equipment every second round of polling period; and if the continuous M times of key index data are equal to the second preset index threshold value or the survivability detection data are smaller than the first preset detection threshold value, deleting the critical index data from the hidden danger equipment list. According to the embodiment of the invention, through adopting two inspection modes of full-quantity round inspection and hidden danger equipment round inspection, the high-precision data comparison analysis of different time periods under the fault condition is satisfied, and the fault processing efficiency is improved and the equipment maintenance plan is optimized.

Description

Method, device, equipment and medium for real-time round-robin monitoring of mass network equipment

Technical Field

The present invention relates to the field of network monitoring technologies, and in particular, to a method for real-time round robin monitoring of a mass network device, a device for real-time round robin monitoring of a mass network device, an electronic device, and a storage medium.

Background

With the faster and faster business development, more and more network devices are provided, and the network quality requirement is higher and higher under the promotion of the change of cloud numbers and wave tides.

The current access network equipment faults only depend on network management dispatch alarms, manual inspection is difficult to realize excessive equipment, long time consumption, excessive key indexes to inspect, personal errors and the like, the passive access network operation and maintenance mode influences service quality and user perception, and aiming at abnormal conditions such as error codes of ports or packet loss of ports of off-network equipment, targeted round inspection cannot be realized, if a full-quantity and high-precision data round inspection scheme is adopted, the phenomena of difficult data storage and the like are caused easily, and the problems of fault discovery, overlong fault positioning duration and the like are caused.

Disclosure of Invention

Aiming at the defects in the prior art, the embodiment of the invention provides a real-time round-robin monitoring method for mass network equipment, a real-time round-robin monitoring device for mass network equipment, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present invention provides a method for real-time round robin monitoring of a mass network device, including:

Periodically acquiring a target network equipment list to be monitored;

Collecting network monitoring data of each target network device in the target network device list every first round of polling period, wherein the network monitoring data comprises key index data and survivability detection data;

If the key index data of the first target network equipment exceeds a first preset index threshold value or the survivability detection data of the first target network equipment is larger than the first preset detection threshold value for N continuous first round of inspection periods, adding the first target network equipment into a hidden danger equipment list;

acquiring network monitoring data of each hidden trouble device in the hidden trouble device list every second round of polling period, wherein the second round of polling period is smaller than the first round of polling period;

And if the key index data of the first hidden danger equipment are equal to a second preset index threshold value or the survivability detection data of the first hidden danger equipment are smaller than the first preset detection threshold value for M continuous second round inspection periods, deleting the first hidden danger equipment from the hidden danger equipment list, wherein M is larger than N, and the second preset index threshold value is smaller than the first preset index threshold value.

As in the above method, optionally, the key indicator data comprises port data and the survivability detection data comprises delay data;

The key index data of the first target network device exceed a first preset index threshold, or the survivability detection data of the first target network device are larger than a first preset detection threshold, including:

the port data of the first target network device exceeds a first preset port threshold value, or the delay data of the first target network device is larger than a first preset delay threshold value;

The key index data of the first hidden danger equipment are equal to a second preset index threshold value, or the survivability detection data of the first hidden danger equipment are smaller than the first preset detection threshold value, including:

the port data of the first hidden danger device are equal to a second preset port threshold value, or the time delay data of the first hidden danger device are smaller than the first preset time delay threshold value.

The method, optionally, collecting network monitoring data of each target network device in the target network device list includes:

Collecting SNMP data of each target network device in the target network device list based on an SNMP monitoring component in a Prometaheus monitoring system, wherein the SNMP data comprises port error codes of SNMP data packets;

Acquiring ICMP packet availability detection data of each target network device in the target network device list based on a black box monitoring component in the Prometaus monitoring system;

Correspondingly, the port data of the first target network device exceeds a first preset port threshold, or the delay data of the first target network device is greater than a first preset delay threshold, including:

The port error code increment values of the SNMP data packets of the first target network equipment exceed a first preset port threshold value;

Or (b)

And the response time length of the ICMP packet availability detection data of the first target network device is larger than a first preset time delay threshold.

The method optionally further includes, after collecting network monitoring data of each target network device in the target network device list at each first round of polling periods:

And drawing a visualized monitoring packet aiming at the network monitoring data of each target network device in the target network device list based on the graph editor.

In the above method, optionally, after collecting the network monitoring data of each hidden danger device in the hidden danger device list at every second round of polling period, the method further includes:

And drawing a visual monitoring group aiming at network monitoring data of each hidden danger device in the hidden danger device list based on the graphic editor.

The method, optionally, after collecting the network monitoring data of each target network device in the target network device list at each first round of polling periods, further includes:

and if the response time length of the ICMP packet availability detection data of the second target network equipment exceeds a second preset time delay threshold, generating alarm information aiming at the second target network equipment, wherein the second preset time delay threshold is larger than the first preset time delay threshold.

The method as above, optionally, further comprising:

and after the characteristic processing is carried out on the alarm information, sending the processed alarm information to operation and maintenance personnel.

In a second aspect, an embodiment of the present invention provides a device for real-time round robin monitoring of a mass network device, including:

The acquisition module is used for periodically acquiring a target network equipment list to be monitored;

the first acquisition module is used for acquiring network monitoring data of each target network device in the target network device list every first round of polling period, wherein the network monitoring data comprises key index data and survivability detection data;

The hidden danger equipment screening module is used for adding the first target network equipment into a hidden danger equipment list if the key index data of the first target network equipment exceeds a first preset index threshold value or the survivability detection data of the first target network equipment is larger than the first preset detection threshold value in N continuous first round of inspection periods;

The second acquisition module is used for acquiring network monitoring data of each hidden trouble device in the hidden trouble device list every second round of polling period, wherein the second round of polling period is smaller than the first round of polling period;

And the hidden danger equipment updating module is used for deleting the first hidden danger equipment from the hidden danger equipment list if the key index data of the first hidden danger equipment are equal to a second preset index threshold value or the survivability detection data of the first hidden danger equipment are smaller than the first preset detection threshold value in M continuous second round inspection periods, wherein M is larger than N, and the second preset index threshold value is smaller than the first preset index threshold value.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

The device comprises a memory and a processor, wherein the processor and the memory are communicated with each other through a bus; the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of real-time round robin monitoring of a mass network device as described in any of the first aspects above.

In a fourth aspect, an embodiment of the present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements a method for real-time round robin monitoring of a mass network device as described in any of the first aspects above.

The method for real-time round-robin monitoring of the mass network equipment periodically acquires a target network equipment list to be monitored; collecting network monitoring data of each target network device in the target network device list every first round of polling period, wherein the network monitoring data comprises key index data and survivability detection data; if the key index data of the first target network equipment exceeds a first preset index threshold value or the survivability detection data of the first target network equipment is larger than the first preset detection threshold value for N continuous first round of inspection periods, adding the first target network equipment into a hidden danger equipment list; acquiring network monitoring data of each hidden trouble device in the hidden trouble device list every second round of polling period, wherein the second round of polling period is smaller than the first round of polling period; and if the key index data of the first hidden danger equipment are equal to a second preset index threshold value or the survivability detection data of the first hidden danger equipment are smaller than the first preset detection threshold value for M continuous second round inspection periods, deleting the first hidden danger equipment from the hidden danger equipment list, wherein M is larger than N, and the second preset index threshold value is smaller than the first preset index threshold value. According to the real-time round inspection method for the mass network equipment, provided by the embodiment of the invention, by adopting two inspection modes of full-quantity round inspection and hidden danger equipment round inspection, the round inspection plan is flexibly adjusted to implement intelligent dynamic customized inspection aiming at the importance, the running condition and the historical data of the equipment, so that the load of a mass and full-quantity monitoring target on a monitoring system can be reduced, more detailed and comprehensive data can be generated aiming at the hidden danger equipment round inspection, the prediction capability of fault occurrence is provided, the high-precision data comparison analysis of different time periods under the fault condition is met, the fault processing efficiency is improved, and the equipment maintenance plan is optimized.

Drawings

FIG. 1 is a flow chart of steps of an embodiment of a method for real-time round robin monitoring of a mass network device of the present invention;

Fig. 2 is a schematic diagram of a promethaus system in an embodiment of a method for real-time round robin monitoring of a mass network device according to the present invention;

FIG. 3 is a schematic view of a visual interface in an embodiment of a method for real-time round robin monitoring of a mass network device according to the present invention;

Fig. 4 is a schematic flow chart of a promethaus-based system in an embodiment of a method for real-time round robin monitoring of a mass network device according to the present invention;

FIG. 5 is a flow chart of steps of another embodiment of a method for real-time round robin monitoring of a mass network device of the present invention;

FIG. 6 is a block diagram illustrating an embodiment of an apparatus for real-time round robin monitoring of a mass network device in accordance with the present invention;

Fig. 7 is a block diagram of an embodiment of an electronic device of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Referring to fig. 1, a step flow chart of an embodiment of a method for real-time round robin monitoring of a mass network device of the present invention is shown, which is applied to a device for real-time round robin monitoring of a mass network device, and specifically may include the following steps:

step S110, periodically acquiring a target network equipment list to be monitored;

Specifically, the access network equipment maintenance management division belongs to each county maintenance team, and has the problems of a large number of equipment, usually more than 10000, frequent equipment network withdrawal and new increase times, no unified account and the like, and the conventional monitoring system needs to manually update the equipment List, is time-consuming and easy to make mistakes. The target network device may include an Optical Line Terminal (OLT), a switch, a server, or the like.

Aiming at mass network equipment to be monitored in a certain area, a target network equipment List List1 is generated in batches through self-programming script, for example, a programming program carries out batch network diagnosis ping and simple network management protocol (Simple Network Management Protocol, SNMP) monitoring on equipment network segments such as (10.121.0.0/16), a lightweight equipment List (List 1) in a data exchange format JSON format is generated, and the stability and usability of the system are improved by adopting a mode of storing a database by using a high-availability distributed key value ETCD.

And then, every interval is preset with an updating period, for example, the preset updating period is 5 days, the target network equipment List1 is updated, the off-line network equipment is removed in time, new equipment is added in time, and the data accuracy is improved. In order to timely, accurately, intelligently and comprehensively grasp the health state of the network equipment, intelligent dynamic customized inspection is implemented, firstly, a target network equipment List1 is acquired, and the target network equipment List1 is periodically updated, so that the target network equipment List1 can be periodically acquired, and the acquisition period can be the same as a preset updating period.

Step S120, collecting network monitoring data of each target network device in the target network device list every first round of polling period, wherein the network monitoring data comprises key index data and survivability detection data;

Specifically, a first round of polling period is set, for example, the first round of polling period is set to be 5 minutes (300 s), and every 5 minutes, network monitoring data of each target network device in the target network device List1 is collected by polling, where the network monitoring data may include key index data and survivability detection data, the survivability detection data is used for detecting whether the target network device is online, the key index data is used for detecting whether a polling index of the target network device is normal, and the first round of polling period is smaller than a preset updating period of the target network device List 1.

Step S130, if the key index data of the first target network device exceeds a first preset index threshold value or the survivability detection data of the first target network device is larger than the first preset detection threshold value for N continuous first round of polling periods, adding the first target network device into a hidden danger device list;

Specifically, if key index data of a certain target network device (denoted as a first target network device) exceeds a first preset index threshold value in a first round of polling period acquisition process for N times continuously, the first target network device is added into a hidden danger device List2, or in a first round of polling period acquisition process for N times continuously, survival detection data of the first target network device is larger than the first preset detection threshold value, the first target network device is added into the hidden danger device List 2.

For example, if the continuous 3 times of key index data in the collected network monitoring data of the network device a exceed the first preset index threshold value or the continuous 3 times of survivability detection data are all greater than the first preset detection threshold value, the network device a is used as hidden danger equipment and added into a hidden danger equipment List 2.

Step S140, collecting network monitoring data of each hidden trouble device in the hidden trouble device list every second round of polling period, wherein the second round of polling period is smaller than the first round of polling period;

Specifically, for the hidden danger equipment List2, a second round of polling period is set, the second round of polling period is smaller than the first round of polling period, for example, the second round of polling period is 10s, every interval is 10s, network monitoring data of each hidden danger equipment in the hidden danger equipment List2 are collected in a round of polling, the network monitoring data can comprise key index data and survivability detection data, the survivability detection data is used for detecting whether the hidden danger equipment is online, and the key index data is used for detecting whether the polling index of the hidden danger equipment is normal. Aiming at the List1 and the List2, different round inspection periods are set, so that monitoring inspection with different precision is provided.

Step S150, if M consecutive second round of inspection cycles are smaller than the first round of inspection cycles, for example, the second round of inspection cycles are 10S, the key index data of the first hidden danger device are equal to the second preset index threshold value or the survivability detection data of the first hidden danger device are smaller than the first preset detection threshold value every 10S, and then the first hidden danger device is deleted from the hidden danger device list, where M is greater than N, and the second preset index threshold value is smaller than the first preset index threshold value.

Specifically, if key index data of a certain hidden danger device (marked as first hidden danger device) is equal to a second preset index threshold value in a second round of polling period collection process for M times continuously, the first hidden danger device is removed from a hidden danger device List2, or in a second round of polling period collection process for M times continuously, survival detection data of the first hidden danger device is smaller than the first preset detection threshold value, the first hidden danger device is removed from the hidden danger device List2, wherein M is greater than N, and the second preset index threshold value is smaller than the first preset index threshold value.

For example, if the continuous 10 times of key index data in the collected network monitoring data of the hidden danger device a are equal to the second preset index threshold value or the continuous 10 times of survivability detection data are smaller than the first preset detection threshold value, the hidden danger device a is removed from the hidden danger device List 2. Aiming at the abnormal situation in the network equipment, the intelligent dynamic customized inspection is flexibly regulated according to the importance, the running condition and the historical data of the equipment, and the phenomena of difficult data storage and the like caused by overhigh system load are avoided.

The operation and maintenance personnel can rapidly position hidden trouble equipment through a hidden trouble equipment List List2, provide the predictive ability of fault occurrence, and realize real-time round-robin monitoring of mass network equipment by adopting two kinds of round-robin inspection modes of full-quantity round-robin and hidden trouble equipment round-robin.

Further, on the basis of the above embodiment, the key indicator data includes port data, and the survivability detection data includes delay data;

Specifically, the key index data of the network monitoring data may include port data such as port traffic and port errors, for example, relay port error increase, passive Optical Network (PON) port error increase, and the like. The viability detection data may include probe response isochronous data such as ping response data, internet control message protocol (Internet Control Message Protocol, ICMP) packet availability probe data responses, and the like.

If the port data of the first target network device exceeds the first preset port threshold value or the delay data of the first target network device is larger than the first preset delay threshold value for N continuous first round of polling periods, adding the first target network device into a hidden danger device List (List 2);

If the port data of the first hidden danger equipment are equal to the second preset port threshold value or the time delay data of the first hidden danger equipment are smaller than the first preset time delay threshold value in M continuous second round inspection periods, deleting the first hidden danger equipment from the hidden danger equipment List 2.

In the embodiment of the invention, aiming at abnormal conditions such as error codes at ports or packet loss at ports of the off-network equipment, targeted round robin is performed, and the health state of the network equipment is timely, accurately, intelligently and comprehensively mastered.

On the basis of the foregoing embodiments, further, the collecting network monitoring data of each target network device in the target network device list includes:

Or (b)

In particular, prometheus (Prometheus) is an open source service monitoring system and time series database that provides a generic data model and a quick data acquisition, storage and query interface. Fig. 2 is a schematic structural diagram of a promethaus system in an embodiment of a method for real-time round monitoring of a mass network device according to the present invention, as shown in fig. 2, a core component Prometheus server of promethaus periodically pulls data from a statically configured monitoring target or a target automatically configured based on service discovery, and the newly pulled data is persisted into a storage device. Prometaus provides various monitoring modules, and can realize rapid deployment of inspection monitoring of platform components such as cloud network IaaS (infrastructure as a service), paaS (platform as a service), saaS (software as a service) and the like.

In the Prometheus monitoring system, an SNMP monitoring component snmp_ explorter is responsible for collecting SNMP data and providing key index data of monitoring target network equipment; the black box monitoring component blackbox_ exporter is responsible for providing monitoring data collection of hypertext transfer protocol (Hypertext Transfer Protocol, HTTP), domain name system (Domain NAME SYSTEM, DNS), transmission control protocol (Transmission Control Protocol, TCP) and ICMP, and Prometheus server can store the collected monitoring index data in a local disk or database.

Prometaus pulls index data (metrics) periodically in pull mode based on HTTP call (HTTP/HTTPs request), 2 data acquisition modes can be selected: one is to periodically pull monitoring index data from active target devices using self-contained exporter components; the second is that the target host reports the data to the push component pushgateway, and Prometheus server pulls the data from pushgateway in a unified way. Wherein exporters is an index exposer, which is responsible for collecting and aggregating data in original format from a target application program, and converting or aggregating the data into formatted data for Prometheus server service to call. In the embodiment of the invention, the self exporter component is adopted to periodically Pull (Pull) monitoring index data from active target equipment, such as: the survivability of the target and port data information are monitored.

When data are stored, prometheus server dynamically discovers a Target to be monitored, namely a monitoring Target configured by a exporter component in the embodiment of the invention, such as network devices in a Target network device List1 and a hidden danger device List2, through a service discovery component (Service Discovery); the RETRIEVAL component is responsible for capturing the monitoring index data on the active target host, storing the acquired data to disk through a Time Series Database (TSDB), and defaulting to 15 days, which can be modified.

In the embodiment of the invention, the snmp_ explorter component and the blackbox_ explorter component are called through a Prometheus server module, the snmp_ explorter component collects port error codes of SNMP data packets of network equipment, the blackbox_ explorter component collects ICMP packet availability detection data of the network equipment, and the network equipment comprises target network equipment in a List1 and hidden danger equipment in a List2. The List1 and the List2 can be updated by using Consul tools, consul is an open source tool developed by Go language, which is used for realizing service discovery and configuration of a distributed system, registering services and discovery services by using an HTTP interface provided by Consul, and dynamically updating the target network device List1 and the hidden danger device List2 by adopting a Web management configuration mode.

Setting round rules of a first round of polling period and a second round of polling period, wherein the snmp_ explorter component collects port error codes of each target network device in the List1 every first round of polling period, and the blackbox_ explorter component collects ICMP packet availability detection data of each target network device in the List1 every first round of polling period. Prometheus server the port error code collected by the snmp_ explorter component and the ICMP packet availability detection data collected by the blackbox_ explorter component are obtained every first round of polling period, and stored in a local disk or database.

Prometheus server monitors that the response time T of the last N ICMP packet availability detection data of the first network device in the data collected by the snmp_ explorter and the blackbox_ explorter exceeds a first preset delay threshold T1 or the port error code increment value C of the last N SNMP data packets exceeds a first preset port threshold C1, dynamically updating a hidden danger device List2 through a service discovery component of the open source tool consul, and adding the first network device into the List 2.

The snmp_ explorter component collects port error codes of each hidden trouble device in the List2 every second round of polling periods, and the blackbox_ explorter component collects ICMP packet availability detection data of each hidden trouble device in the List2 every second round of polling periods. Prometheus server acquire port error codes acquired by the snmp_ explorter component and ICMP packet availability detection data acquired by the blackbox_ explorter component every second round of polling period, and store the data.

Prometheus server monitoring that the response time T of the latest M ICMP packet availability detection data of the first hidden trouble device in the data collected by the snmp_ explorter and the blackbox_ explorter is lower than a first preset delay threshold T1 or the port error code increment value C of the latest M SNMP data packets is equal to a second preset port threshold C2, dynamically updating a hidden trouble device List2 through a service discovery component of the open source tool consul, and deleting the first hidden trouble device from Lsit.

In addition, in order to ensure data security, the data communication between each monitoring component and the monitoring target adopts an SNMP v3 zero trust mode.

According to the method for real-time round-robin monitoring of the mass network equipment, the system is built on the cloud platform, centralized storage and management of equipment data can be achieved, remote monitoring and operation are facilitated, management efficiency and expandability are improved, a main program is from an Internet open source community, the method has the characteristics of being quick in version updating, transparent in program codes, high in usability, customizable and the like, the monitoring capability is high in pertinence in complex scenes, core services and modules are distributed on the cloud platform, clustered management nodes and high-availability distributed key value ETCD storage databases are used for building, and stability and usability of the system are improved.

On the basis of the foregoing embodiments, further, after collecting the network monitoring data of each target network device in the target network device list at each first round of polling periods, the method further includes:

Specifically, as shown in fig. 2, prometheus server provides HTTP SERVER provides interfaces such as: promQL, promQL is a query language module provided by Prometheus. Prometheus webUI, graphic editor Grafana, client API CLIENTS query, analyze, and present data in a database through the PromQL module provided by HTTP SERVER.

In the embodiment of the invention, the Grafana component is used for carrying out visual representation of the inspection data, the Grafana component belongs to an open-source third-party data visualization tool, and mainly provides a rich and customizable data instrument panel, and multi-dimensional and multi-combination data display functions such as equipment online rate, port error code TOP10 and the like. Prometheus server acquires the data acquired by the snmp_ explorter and the blackbox_ explorter, then sends the data to the Grafana component, and the Grafana component draws the data according to drawing requirements, which can include dynamic data display such as the offline rate of network equipment, the offline equipment name, the equipment port error code Top10 and the like, and provides data review capability of history P hours.

Fig. 3 is a schematic view of a visual interface in an embodiment of a real-time round-robin monitoring method for a mass network device, as shown in fig. 3, where network monitoring data for each network device in List1 is obtained through a Grafana component, results corresponding to each index are obtained, the current number of off-line devices and the number of normal devices are displayed in real time, 20 bits of network device information before PON port error is increased, detailed information of the off-line devices, acquisition time and other information are displayed, and meanwhile, multiple index screening classification is visually displayed, so that the operation and maintenance personnel can conveniently check the information.

The real-time round-robin monitoring method for the mass network equipment provided by the embodiment of the invention realizes real-time visual monitoring of the patrol result based on the graphic editor, can automatically count the patrol result and generate a report, and simultaneously intuitively displays multi-index screening classification, thereby being convenient for notification management.

On the basis of the foregoing embodiments, further, after collecting the network monitoring data of each hidden danger device in the hidden danger device list at every second round of polling period, the method further includes:

Specifically, network monitoring data aiming at each hidden trouble device in a List2 is obtained through Grafana components, results corresponding to each index are obtained, the current off-network device number and the normal device number are displayed in real time, Q-bit network device information before the PON port error code is increased, detailed information of the off-network device, acquisition time and other information are displayed, and meanwhile, multiple index screening classification is visually displayed, so that operation and maintenance personnel can conveniently check the information. Meanwhile, the Grafana component can draw a trend chart of the ICMP packet detection response time of the hidden danger equipment according to the network monitoring data acquired at different times, show the delay jitter, the maximum value, the minimum value and the average value of the hidden danger equipment, draw the SNMP data port error code increment TOP10 of the hidden danger equipment and the like. Further, since the hidden danger devices in the List2 can be updated every second preset round of polling periods, the Grafana component can also acquire network monitoring data of the hidden danger devices in the List2 of the second preset round of polling periods, and dynamically display the change process of the hidden danger devices in the List 2.

On the basis of the foregoing embodiments, further, after collecting the network monitoring data of each target network device in the target network device list in the first round of polling periods at each interval, the method further includes:

Specifically, the blackbox_ explorter collects ICMP packet availability detection data of the network device every first round of polling period, if the response time of the ICMP packet availability detection data of a certain target network device (second target network device) exceeds a second preset time delay threshold T2, wherein the second preset time delay threshold T2 is greater than the first preset time delay threshold T1, the second target network device is not online, and generates alarm information for the unavailable target as the unavailable target. Fig. 4 is a schematic flow diagram of a method for real-time round robin monitoring of a mass network device according to an embodiment of the present invention, and as shown in fig. 2 and fig. 4, an alarm notification component ALERTMANAGER may perform an alarm operation, including suppression and notification. Prometheus server sets an alert rule that directly triggers Prometheus server an alert rule for unavailable targets to push device offline alerts to ALERTMANAGER.

In practical application, if at least one of the port data of the second hidden danger equipment is not equal to the second preset port threshold value and at least one of the delay data of the second hidden danger equipment is greater than the first preset delay threshold value, alarm information aiming at the second hidden danger equipment is generated.

The method for real-time round inspection monitoring of the mass network equipment provided by the embodiment of the invention takes Prometaus as a core, grafana as a display interface, invokes the snmp_ explorter, blackbox _ explorter monitoring component, uses ALERTMANAGER as an alarm component, adopts two inspection modes of full-quantity round inspection and hidden danger equipment round inspection, implements intelligent dynamic customized inspection, reduces the load of a mass and full-quantity monitoring target on a monitoring system, generates more detailed and comprehensive data and trend analysis aiming at the hidden danger equipment round inspection, provides fault occurrence prediction capability, pushes alarm information and displays the real-time visual monitoring data.

On the basis of the above embodiments, further comprising:

Specifically, the alarm function of ALERTMANAGER is divided into two separate parts, including alarm pushing and alarm notification. Prometheus Server storing the collected data index metrics locally, running defined alarm rules, periodically screening the collected network monitoring data information, and pushing an alarm to ALERTMANAGER when an alarm triggering condition is met; the alarm notification is responsible for ALERTMANAGER, after receiving the "alarm notification" from Prometheus server, the alarm is notified after processing by using the characteristics of alarm duplication removal, suppression, grouping, etc., as shown in fig. 4, and the supported receiving modes are as follows: email, nail, enterprise WeChat, electronic operation and maintenance system worksheets, etc. After processing, maintenance personnel are notified through mail, QQ, weChat, short message and other modes, such as alarm notification of abnormal conditions such as off-line of target equipment, port error increase and the like.

Fig. 5 is a flowchart illustrating steps of another embodiment of a method for real-time round robin monitoring of a mass network device according to the present invention, where the real-time round robin monitoring process includes:

S51, prometheus server, a target network device List1 generated by regular batch updating of self-script is called;

S52, prometheus server defines round-robin rules every 5 minutes, acquires SNMP and ICMP packet availability detection data of each network device in a target network device List1, and comprises time sequence data such as device offline number, offline device name, device port error code Top10 and the like, and is stored in a local database in a lasting manner;

S53, drawing a visual monitoring group for the patrol data corresponding to the List1 in the step S52 by using a Grafana component, providing dynamic data display such as equipment offline rate, offline equipment name, equipment port error Top10 and the like, and providing data review capability for 48 hours in history;

s54, aiming at an unavailable target, directly triggering Prometheus server an alarm rule, and pushing the offline alarm of the device to ALERTMANAGER.

S55, prometheus server monitors that the response time T of the last 3 ICMP packet availability detection data in the data collected by the snmp_ explorter and the blackbox_ explorter exceeds a specified time delay threshold T or the port error code increment value C of the last 3 SNMP data packet exceeds a specified threshold C, and then the hidden danger equipment List2 is dynamically updated and generated through a service discovery component of the open source tool consul;

s56, prometheus server collects SNMP data and ICMP packet availability detection data of each hidden trouble device in a hidden trouble device List2 every 60 seconds;

S57, drawing a visualized monitoring group by using Grafana to the patrol data corresponding to the List2, wherein the visualized monitoring group comprises the following steps: drawing a trend chart of the ICMP packet detection response time of the hidden danger equipment, showing the delay jitter, the maximum, the minimum and the average value of the hidden danger equipment, drawing the SNMP data port error code increment TOP10 of the hidden danger equipment and the like;

s58, prometheus server monitors ICMP packet availability detection data response time T of the hidden danger equipment List2 continuously for 10 times is lower than a specified time delay threshold T or the error code increment value of the hidden danger equipment port of the List2 continuously for 10 times is 0, and intelligently and dynamically removes the hidden danger equipment List2 through a service discovery component of the open source tool consul;

In this embodiment, the custom round-robin rule is determined according to the time precision of default collection of the SNMP protocol, when the system collects the abnormal information of the device and then receives the information from the monitoring personnel, the duration is usually longer than 5 minutes, if uncontrollable factors are encountered, the duration can reach more than 10 minutes, and the fault discovery and processing are affected, so that the 60s collection precision is implemented for the hidden trouble device List2, and the service unavailable duration generated by more interference factors can be avoided.

In addition, the dynamic update condition of the hidden danger equipment List2 is set to be a low-delay ICMP detection result of 10 times continuously and a 0-port error code increment value of 10 times continuously, and the condition that the fluctuation of the network causes the abnormality of collected data, such as line aging, interface loosening, transmission wavelength division system switching and the like is considered, the result of the hidden danger equipment List2 can be reflected by the 10-time collection record without being influenced by external force factors, the frequent adverse continuity of data chart drawing caused by the List update is caused by the too short collection record, and unnecessary data is generated by the too long collection record, so that the load of a system is increased.

The method for real-time round-robin monitoring of the mass network equipment provided by the embodiment of the invention realizes automatic patrol of the access network equipment, comprises the steps of online equipment quantity, equipment off-line condition, port error code increase and the like, and can be completed only a few seconds when a thousand of access network equipment is manually patrol in the prior art, thereby saving labor cost. In addition, the real-time round-robin monitoring method for the mass network equipment provided by the embodiment of the invention can be rapidly deployed and applied to the automatic patrol monitoring of the private cloud platform and the private network, solves the problems of low working efficiency, complex patrol result, long fault positioning time, non-visual patrol result display and the like of the private cloud operation and maintenance patrol in the current stage, can realize the centralized storage and management of equipment data by setting up a system on the cloud platform, is convenient for remote monitoring and operation, and improves the management efficiency and the expandability.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 6, a block diagram of an embodiment of an apparatus for real-time round robin monitoring of a mass network device according to the present invention may specifically include the following modules:

an obtaining module 610, configured to periodically obtain a list of target network devices to be monitored;

A first collection module 620, configured to collect network monitoring data of each target network device in the target network device list every first round of polling periods, where the network monitoring data includes key index data and survivability detection data;

The hidden danger device screening module 630 is configured to add the first target network device to a hidden danger device list if the key index data of the first target network device exceeds a first preset index threshold or the survivability detection data of the first target network device is greater than a first preset detection threshold for N consecutive first rounds of polling periods;

the second collection module 640 is configured to collect network monitoring data of each hidden danger device in the hidden danger device list every second round of polling period, where the second round of polling period is smaller than the first round of polling period;

The hidden danger equipment updating module 650 is configured to delete the first hidden danger equipment from the hidden danger equipment list if the key index data of the first hidden danger equipment is equal to a second preset index threshold value or the survivability detection data of the first hidden danger equipment is smaller than the first preset detection threshold value in M consecutive second round of inspection periods, where M is greater than N, and the second preset index threshold value is smaller than the first preset index threshold value.

As in the above apparatus, optionally, the key indicator data comprises port data and the survivability detection data comprises delay data;

In the above apparatus, optionally, the first acquisition module is specifically configured to:

Or (b)

The apparatus as above, optionally, further comprising:

And the visualization module is used for drawing visualized monitoring groups for the network monitoring data of each target network device in the target network device list based on the graphic editor after collecting the network monitoring data of each target network device in the target network device list every first round of polling period.

According to the device, optionally, the visualization module is further configured to draw a visualized monitoring packet for the network monitoring data of each hidden danger device in the hidden danger device list based on the graphic editor after collecting the network monitoring data of each hidden danger device in the hidden danger device list every second round of polling period.

The apparatus as above, optionally, further comprising:

And the alarm module is used for generating alarm information aiming at the second target network equipment if the response time length of the ICMP packet availability detection data of the second target network equipment exceeds a second preset time delay threshold after the network monitoring data of each target network equipment in the target network equipment list are acquired every first round of polling period, wherein the second preset time delay threshold is larger than the first preset time delay threshold.

The above device, optionally, the alarm module is specifically configured to send the processed alarm information to an operation and maintenance personnel after performing characteristic processing on the alarm information

For the device embodiment, since the device embodiment is substantially similar to the method embodiment, the description is relatively simple, and the relevant points only need to be referred to the part of the description of the method embodiment, which is not repeated herein.

Referring to fig. 7, there is shown a block diagram of an embodiment of an electronic device of the present invention, the device comprising: a processor (processor) 710, a memory (memory) 720, and a bus 730;

wherein processor 710 and memory 720 communicate with each other via bus 730;

Processor 710 is configured to invoke program instructions in memory 720 to perform the methods provided by the method embodiments described above, including, for example: periodically acquiring a target network equipment list to be monitored; collecting network monitoring data of each target network device in the target network device list every first round of polling period, wherein the network monitoring data comprises key index data and survivability detection data; if the key index data of the first target network equipment exceeds a first preset index threshold value or the survivability detection data of the first target network equipment is larger than the first preset detection threshold value for N continuous first round of inspection periods, adding the first target network equipment into a hidden danger equipment list; acquiring network monitoring data of each hidden trouble device in the hidden trouble device list every second round of polling period, wherein the second round of polling period is smaller than the first round of polling period; and if the key index data of the first hidden danger equipment are equal to a second preset index threshold value or the survivability detection data of the first hidden danger equipment are smaller than the first preset detection threshold value for M continuous second round inspection periods, deleting the first hidden danger equipment from the hidden danger equipment list, wherein M is larger than N, and the second preset index threshold value is smaller than the first preset index threshold value.

Embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the method embodiments described above, for example comprising: periodically acquiring a target network equipment list to be monitored; collecting network monitoring data of each target network device in the target network device list every first round of polling period, wherein the network monitoring data comprises key index data and survivability detection data; if the key index data of the first target network equipment exceeds a first preset index threshold value or the survivability detection data of the first target network equipment is larger than the first preset detection threshold value for N continuous first round of inspection periods, adding the first target network equipment into a hidden danger equipment list; acquiring network monitoring data of each hidden trouble device in the hidden trouble device list every second round of polling period, wherein the second round of polling period is smaller than the first round of polling period; and if the key index data of the first hidden danger equipment are equal to a second preset index threshold value or the survivability detection data of the first hidden danger equipment are smaller than the first preset detection threshold value for M continuous second round inspection periods, deleting the first hidden danger equipment from the hidden danger equipment list, wherein M is larger than N, and the second preset index threshold value is smaller than the first preset index threshold value.

Embodiments of the present invention provide a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: periodically acquiring a target network equipment list to be monitored; collecting network monitoring data of each target network device in the target network device list every first round of polling period, wherein the network monitoring data comprises key index data and survivability detection data; if the key index data of the first target network equipment exceeds a first preset index threshold value or the survivability detection data of the first target network equipment is larger than the first preset detection threshold value for N continuous first round of inspection periods, adding the first target network equipment into a hidden danger equipment list; acquiring network monitoring data of each hidden trouble device in the hidden trouble device list every second round of polling period, wherein the second round of polling period is smaller than the first round of polling period; and if the key index data of the first hidden danger equipment are equal to a second preset index threshold value or the survivability detection data of the first hidden danger equipment are smaller than the first preset detection threshold value for M continuous second round inspection periods, deleting the first hidden danger equipment from the hidden danger equipment list, wherein M is larger than N, and the second preset index threshold value is smaller than the first preset index threshold value.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.

The above detailed description of the method for real-time round robin monitoring of a mass network device, the real-time round robin monitoring device of a mass network device, an electronic device and a storage medium provided by the invention applies specific examples to illustrate the principle and implementation of the invention, and the above description of the examples is only used for helping to understand the method and core ideas of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. The real-time round-robin monitoring method for the mass network equipment is characterized by comprising the following steps of:

Periodically acquiring a target network equipment list to be monitored;

2. The method of claim 1, wherein the key indicator data comprises port data and the survivability detection data comprises delay data;

3. The method of claim 2, wherein the collecting network monitoring data for each target network device in the list of target network devices comprises:

Or (b)

4. The method of claim 3, further comprising, after collecting network monitoring data for each target network device in the list of target network devices every first round of polling cycles:

5. The method of claim 4, further comprising, after collecting network monitoring data for each hidden trouble device in the list of hidden trouble devices every second round of polling cycles:

6. The method of claim 3, wherein after collecting the network monitoring data of each target network device in the target network device list every first round of polling periods, further comprising:

7. The method as recited in claim 6, further comprising:

8. The utility model provides a real-time round of monitoring device of mass network equipment which characterized in that includes:

9. An electronic device, comprising:

the device comprises a memory and a processor, wherein the processor and the memory are communicated with each other through a bus; the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any one of claims 1 to 7.