JP4412031B2 - Network monitoring system and method, and program - Google Patents

Network monitoring system and method, and program Download PDF

Info

Publication number
JP4412031B2
JP4412031B2 JP2004101827A JP2004101827A JP4412031B2 JP 4412031 B2 JP4412031 B2 JP 4412031B2 JP 2004101827 A JP2004101827 A JP 2004101827A JP 2004101827 A JP2004101827 A JP 2004101827A JP 4412031 B2 JP4412031 B2 JP 4412031B2
Authority
JP
Japan
Prior art keywords
monitoring information
monitoring
information
network
collected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2004101827A
Other languages
Japanese (ja)
Other versions
JP2005285040A (en
Inventor
伸治 加美
到 西岡
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2004101827A priority Critical patent/JP4412031B2/en
Publication of JP2005285040A publication Critical patent/JP2005285040A/en
Application granted granted Critical
Publication of JP4412031B2 publication Critical patent/JP4412031B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Description

  The present invention relates to a network monitoring system, a method thereof, and a program, and more particularly to a failure monitoring method and a failure information analysis method in a communication network.

  With the recent advancement of information society, servers providing various services are constantly operating in data centers and the like, and a huge number of network devices of various types have been introduced to connect these servers. Failure of these network devices not only inconveniences the service user, but the service provider suffers a huge loss. Therefore, it is necessary for the administrator to constantly monitor the network device using the monitoring device. If there is a failure in the network device being monitored, the administrator needs to identify the cause of this failure and quickly recover it.

  As a form of monitoring a network device, there is generally a method of monitoring using SNMP (Simple Network Management Protocol). As a method of collecting monitoring information in this form, there are a method of periodically collecting the operating state of the device by polling, and a method of using a trap that raises an alarm when a threshold value is set in advance on the device side. When a failure occurs, the administrator needs to identify the cause of the failure and analyze the scope of influence based on the information collected by the monitoring device using the above two types of collection methods. There is a problem that the analysis takes an enormous amount of time.

  In order to solve this problem, Patent Document 1 discloses a technique for automatically analyzing failure information. In this technology, based on fuzzy rules for multiple pieces of information collected from network devices, if it is determined that a failure has occurred and which portion has failed, it is detailed which part is the failure. Diagnosis.

  However, due to the complexity of the devices themselves and the increase in the scale of the network, when monitoring network devices in detail, the number of monitoring information to be collected becomes enormous and a load is placed on the network itself for collecting monitoring information. Problem occurs. On the other hand, if an attempt is made to reduce the load on the network, the amount of monitoring information must be reduced, which causes a problem that it becomes difficult for the administrator to grasp the network state in detail.

  In order to solve this problem, in Patent Document 2, only monitoring information limited in advance is collected, and when there is an abnormality in the determination of the monitoring information, the monitoring information associated in advance is collected and further determined. A method of repeating the operation is disclosed. As another problem solving method, Patent Document 3 discloses a method of preferentially collecting monitoring information by polling with respect to a device having a high frequency of failure occurrence in the past.

  In the techniques of Patent Documents 2 and 3, since only the network device that has failed and the item of the failure are centrally managed, it is possible to reduce the load on the network, but the operation occurs after the failure occurs. Therefore, information related to the failure may not be obtained, and the cause of the failure may not be analyzed. Moreover, the problem that the administrator has to perform analysis manually has not been improved.

JP 7-30540 A Japanese Patent Laid-Open No. 8-066302 JP-A-4-239242

  The problem with the three prior arts described above is that the monitoring information cannot be collected from the network device in which the failure has occurred because the operation occurs after the failure has occurred. For example, when there is a problem that the load on the network device becomes very large due to data traffic, the network device cannot respond to the request for acquiring the monitoring information because the load is large even if the monitoring information is collected from this device. . As another example, when a network device is restarted for some reason, the information before the restart is missing, so the administrator has enough information to analyze the reason for the restart. There is a problem that can not be.

  An object of the present invention is to provide a network monitoring system, a method thereof, and a program for acquiring related information before a network device fails, without imposing a load on the network device.

  Another object of the present invention is to provide a network monitoring system, method and program for notifying the administrator of the cause of failure and the analysis result of the failure influence range at the same time in the information collecting process.

The network monitoring system according to the present invention comprises:
A monitoring system for collecting and monitoring information on a plurality of network devices,
Monitoring rule storage means for storing in advance initial monitoring information to be collected from each of the network devices and related monitoring information as monitoring rules;
Sign detection means for detecting a sign of failure by processing initial monitoring information collected from the network device;
Collected monitoring information for searching for monitoring information from the monitoring rule storage means related to the initial monitoring information and identifying the cause of the failure in response to the detection of the warning by the warning detection means, and collecting the searched monitoring information A determination means;
A post-discovery means for performing failure details determination processing based on the monitoring information collected by the collected monitoring information determining means;
It is characterized by including.

A network monitoring method according to the present invention includes:
A monitoring method for collecting and monitoring information of a plurality of network devices,
Prepare a monitoring rule storage means that stores in advance the initial monitoring information to be collected from each of the network devices and the monitoring information related thereto as monitoring rules;
A sign detection step of detecting a sign of failure by processing the initial monitoring information collected from the network device;
Collected monitoring information for collecting the searched monitoring information by searching the monitoring rule storage means for monitoring information specifying the cause of the failure in relation to the initial monitoring information in response to the finding of the sign in the predicting step. A decision step;
A post-discovery step for performing failure details determination processing based on the monitoring information collected by the collected monitoring information determination step;
It is characterized by including.

The program according to the present invention is:
A program for causing a computer to execute a monitoring method for collecting and monitoring information of a plurality of network devices,
Processing to detect a failure sign by processing initial monitoring information collected from the network device; and
A process of searching the monitoring rule storage means for monitoring information that identifies the cause of the failure in relation to the initial monitoring information in response to the sign discovery, and collecting the searched monitoring information ;
A post-discovery process for determining a failure detail based on the related monitoring information;
It is characterized by including.

  The operation of the present invention will be described. In a network monitoring system having a communication function for acquiring monitoring information from a plurality of network devices, the monitoring information collecting unit collects continuous amount information as initial monitoring information, and the monitoring information determining unit statistically analyzes the continuous amount information. If a behavior different from normal is detected, it is considered that a sign of an abnormality has been detected, and the monitoring information database is referred to the monitoring rule database by the monitoring information determination unit. To collect related monitoring information. Then, the cause of the failure is specified by determining the value in the monitoring information determination unit.

  The first effect of the present invention is to minimize the load applied to the network device and the network when the network monitoring system monitors the network device. The reason is that not all monitoring information is taken from the network device at the same time, but the minimum necessary monitoring information related to the alarm of the generated management information is determined, and only the monitoring information based on the determination is required. This is because it only has means to collect.

  The second effect of the present invention is that the network monitoring system can quickly find a network failure. The reason is that the network monitoring system detects a failure sign and starts monitoring the failure related to the sign dynamically and in detail. By starting to dynamically monitor the relevant information based on the indications, it is possible to reduce the information being monitored at the same time, so when monitoring all parameters so far it was a monitoring interval of about 30 minutes, This is because in the present invention, the monitoring interval can be shortened to about 1 minute with the same load as before.

  A third effect of the present invention is that a network administrator can quickly cope with a network failure. The reason is that, in the network management system of the present invention, after the detection of a sign or failure, the cause of the failure is identified and the influence range is inspected, and the result is reported to the network administrator.

  Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the present invention, it is assumed that SNMP (Simple Network Management Protocol) standardized by the Internet Engineering Task Force (IETF) is used as means for collecting information by the information collection unit 101 in FIG. In the description of the present invention, the network monitoring system is an apparatus, and the administrator represents a person who manages the network using the network monitoring system.

  FIG. 1 is a block diagram showing a network monitoring system in a first embodiment of the present invention and a monitored network monitored using the network monitoring system of the present invention. In FIG. 1, the network monitoring system 100 includes monitoring information collected using a monitoring information collecting unit 101 that collects monitoring information from a plurality of network devices 111 and a determination function that is defined in advance by the determination function unit 103. A monitoring information determination unit 102 that determines whether there is an abnormality, a monitoring rule DB (database) 105 that defines a monitoring rule, and a collected monitoring information determination unit that determines monitoring information to be collected next with reference to the monitoring rule DB 105 104 and a log storage unit 106 that stores information collected by the monitoring system and the presence or absence of an alarm.

  The information stored in the log storage unit 106 can be accessed by an administrator who monitors the network through the monitoring terminal 121 in the monitoring site 120. When abnormality information is input to the log storage unit, the information is stored in the monitoring terminal 121. You will be notified automatically.

  As shown in FIG. 2, the information in the monitoring rule DB 105 is composed of a plurality of monitoring objects, and each of the monitoring objects includes a monitoring information name for identifying information monitored by the administrator, and a monitoring information collecting unit. 101 is an MIB (Management Information Base) object name for collecting monitoring information using SNMP, a monitoring tree number indicating the relationship of monitoring information, a monitoring node address indicating a network device to be monitored, a timeout time indicating a monitoring time, and a collection The determination value used for determining the information that has been monitored and the child monitoring tree number indicating the monitoring information to be monitored next are described.

  Also, the monitoring tree number of each monitoring information is delimited by “. (Dot)” such as “1.1” and “1.1.1.1”. Child monitoring information to be monitored can be associated if there is. This monitoring tree is preliminarily constructed by an administrator based on the experience of failures that have occurred so far, and is stored in the monitoring rule DB 105.

  The determination function unit 103 that analyzes the collected monitoring information includes a time series information determination function 103a, a plurality of time series information determination functions 103b, an integer type information determination function 103c, and an array type information determination function 103d. A method for selecting these determination functions will be described.

  Since the monitoring information collected by SNMP is the MIB notation format SMI (Structure of Management Information), the data type of the monitoring information monitored in the present invention is Counter (a non-negative integer that increases with time), Gauge. (A non-negative integer that maintains the maximum value), Integer (integer value), IP Address (IP address), Physical Address (physical address, for example, MAC address), and List (for other data types) A list in which a plurality of values are arranged) and a table (a list in which a plurality of lists are arranged).

  According to these types, if the data type is Counter or Gauge, the time-series information determination function 103a is used. If the data type is a single integer, IP Address, or Physical Address, the integer type information determination function 103d is received from a plurality of network devices. If it is collected counter and gauge, the multiple time series information determination function 103b is the data collected from the list or table of the integer, IP address, physical address, or multiple network devices. Are selected.

  Hereinafter, a determination method for each data type of the monitoring information will be described. When the input monitoring information is a single Counter or Gauge, the network status is diagnosed according to the operation flow shown in FIG. 4 using the time-series information determination function shown in FIG. That is, in the time-series information determination function, past data is statistically processed, the statistically processed data is compared with new data, the magnitude of the outlier is calculated, and abnormality is determined. The monitoring information At collected from the information collecting apparatus 101 is stored in the monitoring information DB 10 during the storage period W (S10), and time series information A [t] is created. Then, the statistical processing device 11 statistically processes the time series information A [t] to derive the occurrence distribution function θ (S11).

  The abnormality determination device 12 compares the new monitoring information At + 1 with the distribution function θ, calculates the difference between At + 1 and the distribution function θ (S12), and compares this difference with the error condition of the monitoring rule DB 105. (S13) If true (abnormal), the monitoring information determination unit 104 is notified of the abnormality and the abnormality is stored in the log storage unit 106 (S14). If false (normal), the collection monitoring information determining unit 104 is notified of normality, and at the same time, the monitoring information DB 10 discards the oldest stored information At-w and stores At + 1. (S15).

  When the input monitoring information is a plurality of counters or gauges, the network status is diagnosed according to the operation flow shown in FIG. 6 using the multiple time-series information determination function shown in FIG. In the multiple time series information determination function, correlation processing is performed on multiple past data, and the correlation processed data is compared with the statistically processed data of the correlation processing, and the magnitude of the outlier is calculated. And determine abnormality. A plurality of pieces of monitoring information At, Bt, Ct collected from the information collection device 101 are stored in the monitoring information DB 10 during the storage period W (S20), and time-series information A [t], B [t], C [t] Is created (S21).

In the correlation processing device 11, the time series information A [t], B [t], and C [t] are subjected to correlation processing to derive covariances Γ AB , Γ BC , and Γ CA (S22). Further, the occurrence distribution functions θ AB , θ BC , and θ CA of these covariances are derived (S23). The abnormality determination device 12 calculates the covariance of the new monitoring information At + 1, Bt + 1, Ct + 1 (S24), compares the covariance with each distribution function θ, and creates new data and distribution. A difference from the function θ is calculated (S25).

  This difference is compared with the error condition of the monitoring rule DB 105 (S26). If true (abnormal), the collection monitoring information determining unit 104 is notified of the abnormality and the abnormality is stored in the log storage unit 106 ( S27). If false (normal), the collection monitoring information determination unit 104 is notified of normality. At the same time, the monitoring information DB 10 discards the oldest stored information At-w, Bt-w, Ct-w and stores At + 1, Bt + 1, Ct + 1 (S28).

  When the input monitoring information is a single integer, IP address, or physical address, the network state is diagnosed according to the operation flow shown in FIG. 8 using the integer type information determination function shown in FIG. The integer type information determination function determines whether or not the collected monitoring information value is normal. The monitoring information A collected from the information collecting apparatus 101 is compared with the error conditions of the monitoring rule DB 105 (S30, S31). If true (abnormal), the collection monitoring information determination unit 104 is notified of the abnormality, and the abnormality is stored in the log storage unit 106 (S32). If false (normal), the collection monitoring information determination The unit 104 is notified of normality (S33).

  When the input monitoring information is a plurality of IP addresses and physical addresses, the network status is diagnosed according to the operation flow shown in FIG. 10 using the array type information determination function shown in FIG. The array type information determination function determines whether or not the logical connection (for example, IP routing table or L2 Forwarding Table) of monitoring information collected from a plurality of network devices is normal. Each table of monitoring information A [x], B [x], and C [x] collected from the information collection device 101 is sent to each network device for each destination by the table combination device 14 as shown in FIG. Are combined into one table (joining table Ω) in which transfer destinations are arranged (S40). The abnormality determination device 12 refers to the configuration information DB 13 and checks the route for each destination (S41). By inspecting the route, it is possible to find a loop or discover a route without a route.

  The method for inspecting the route of the IP routing table will be described as an example with reference to FIG. It can be seen that the network apparatus A transfers to the interface A-1 for the path of Dest1 of the coupling table Ω. Referring to the configuration information DB 13, since the interface A-1 also belongs to the network device A, it is determined that this route is normal. Next, the network apparatus B transfers the path of Dest1 to the interface A-2. Referring to the configuration information DB 13, since the interface A-2 is an interface of the network device A, the network device A with respect to Dest1 has already been inspected, and thus this route is also determined to be normal.

  Next, since the network device C corresponding to the route of Dest1 is transferred to the network device A similarly to the network device B, it is determined that this route is also normal, and all the routes to Dest1 are determined to be normal.

  It can be seen that the network device A transfers the packet to the interface B-2 with respect to Dest2 of the binding table Ω. Next, with reference to the configuration information DB 13, a network device having the interface B- 2 is searched, and it is found that the device is the network device B. Next, in the connection table Ω, the network device B transfers the Dest2 to the interface B-1, and since the interface B-1 is an interface belonging to the same network device B, it is determined as a normal route. In Ω, the next network device C is checked for Dest2.

  Since the network device C does not have a route with respect to Dest2, it is determined that there is no route error. This error information is held until the inspection of all routes is completed. It can be seen that the network device A transfers the packet to the interface C-3 with respect to Dest3 of the binding table Ω.

  Next, referring to the configuration information DB 13, a network device having the interface C- 3 is searched, and it is found that it is the network device C. Next, in the connection table Ω, the network device C transfers Dest3 to the interface A-3, and the interface A-3 is found to be an interface belonging to the network device A. Since the network apparatus A has already been checked in the route of Dest3, an error of loop occurrence is detected in this route. This error information is held until the inspection of all routes is completed (S42).

  Next, it moves to the inspection of the network device B that has not been inspected. It can be seen that the network device B transfers the packet to the interface C-2. With reference to the configuration information DB 13, it can be seen that the interface C-2 belongs to the network device C. In the network device C, since the loop detection error has already occurred for Dest3, the path inspection is completed. When the route inspection is completed (S43), the collection monitoring information determining unit 104 is notified of the abnormality including the error detected during the inspection, and the abnormality information is stored in the log storage unit 106 (S44).

  In this description, the route inspection of the IP routing table has been described as an example, but this method can be similarly applied to the route inspection of the MAC forwarding table such as Ethernet (registered trademark).

  Next, how to combine these four determination functions will be described. FIG. 12 is a table describing the properties of the four determination functions 103a to 103d. The time series information determination function and the multiple time series information determination function are classified as prior discovery type means, and the integer type information determination function and the array type information determination function are classified as post discovery type means. The pre-discovery type means performs statistical processing and correlation processing on the monitoring information, and detects a pattern that has not existed as a sign of abnormality. Since a sign of abnormality can be detected, the monitoring system can detect an abnormality in advance, but on the other hand, since there is a possibility that an abnormality does not actually occur after that, the abnormality detection accuracy is low.

  On the other hand, the post-discovery type means makes a determination using real-time monitoring information from the network device and detects an abnormality. For this reason, after the actual occurrence of an abnormality in the network device, the monitoring system detects the anomaly, but the accuracy of the abnormality detection is high. From these characteristics, by arranging the pre-discovery means upstream of the rule DB tree structure and the post-discovery means downstream of the rule DB tree structure, a sign of failure can be quickly found, and the sign is really a failure Can be quickly confirmed for various types of failures.

  Hereinafter, the operation of the network monitoring system that monitors the state of the network by combining the pre-discovery unit and the post-discovery unit will be described. FIG. 13 is a flowchart showing an operation procedure of the network monitoring system 100 shown in FIG. First, a procedure in which the network device sequentially collects and determines monitoring information based on an event that has occurred will be described with reference to FIGS. 1 and 13.

  The monitoring information collection unit 101 reads the initial monitoring information (monitoring information starting to be collected first in the monitoring rule DB) from the monitoring rule DB 105 (S200), and the monitoring information a (see FIG. 2) at intervals specified in the monitoring rule DB. Is collected using SNMP polling (S201).

  The monitoring information collected from the network device 111 is passed to the monitoring information determination unit 102, and the monitoring information determination unit 102 selects an appropriate determination function from the determination function unit 103 based on the data type of the monitoring information, and A determination is made (S202). In this case, an appropriate determination function is selected based on the data type indicated by the MIB object name shown in FIG. The response of the monitoring information determination unit and the determination value of the monitoring rule DB are compared, and if smaller than the determination value, it is determined to be normal, and if larger, it is determined to be abnormal (S203).

  If there is an abnormality, the collection monitoring information determination unit 104 searches the child monitoring tree number of the monitoring information a that is abnormal with reference to the rule DB, and the monitoring information b of the child monitoring tree number 1.1 (see FIG. 2). The monitoring information collection unit is notified to start the collection (S205). Upon receiving the notification, the monitoring information collection unit 101 collects the monitoring information b at intervals specified in the monitoring rule DB (S201), and thereafter repeats the determination of these monitoring information in the same procedure. At this time, each monitoring information holds an alarm state, the monitoring information is collected while the alarm state of the parent monitoring information a is in an error state, and the determination is continued. At this time, even if the value of the monitoring information becomes false compared with the determination value shown in the monitoring rule DB, the alarm state remains in the error state.

  Next, the alarm release procedure that has occurred will be described with reference to FIGS. Alarm release starts when the network status changes and the judgment value of the monitoring information being monitored changes when a problem is addressed by the network monitor or when the network self-healing function handles it Is done.

  In S203, when the response of the monitoring information determination unit 102 is normal, the collected monitoring information determination unit 103 determines whether the monitoring information corresponding to the lowest layer among the monitored monitoring information is the initial monitoring information ( If it is the initial monitoring information (that is, the monitoring information a in FIG. 2), the monitoring information determining unit 103 does nothing because the alarm has not occurred. In S204, if the monitoring information of the lowest layer is not the initial monitoring information (that is, monitoring information b or monitoring information c in FIG. 2), the monitoring information collecting unit is notified to finish collecting the monitoring information currently being monitored. (S206).

  Next, if the monitoring information of the parent of the monitoring information being monitored is in an alarm state, the determination result of the monitoring information determination unit 102 of the direct parent monitoring information is monitored (S207). If the determination result is true (abnormal), monitoring of the determination result is continued until the determination result becomes false (normal) (S207). If the determination result is false (normal), it is determined whether or not the monitoring information is initial monitoring information (S204), and all alarms are released until the lowest layer of monitoring information being monitored becomes initial monitoring information. The operation after S206 is continued until it is done.

  After all alarms are released, only the initial monitoring information is monitored. If an abnormality occurs again in the initial monitoring information, the same operation is repeated. With this alarm release operation, even if an abnormality is detected by the prior discovery means, it is possible to return to the initial state and continue the normal monitoring operation.

  In the embodiment of the present invention described above, dynamic monitoring information collection using the monitoring rule DB can minimize the load on the network for monitoring information collection. Placing upstream in the rule DB tree structure allows the network administrator to quickly find signs of failure. Placing the post-discovery means downstream in the monitoring rule DB tree structure can cause the failure. It is possible for the network manager to instantaneously determine what is a problem or how far the failure is affected.

  In the embodiment of the present invention, the case where the prior discovery means is arranged upstream of the tree structure of the monitoring rule DB and the post discovery means is arranged downstream is described. However, the present invention is limited to this. Therefore, it is possible to construct the monitoring rule DB in an arbitrary form.

  Next, examples of the present invention will be described. In the embodiment described below, a construction example of the monitoring rule DB 105 that is constructed when a network administrator monitors a failure and an operation example using the same will be described in detail. FIG. 14 is a diagram showing a network configuration used in the first, second and third embodiments of the present invention. As shown in FIG. 14, the network configuration includes routers R1 to R3 and clients H1 and H2, streaming servers H3 and H4, and a hub HUB belonging to the local networks L1 to L3, respectively. Here, it is assumed that the links connecting each other are Fast Ethernet (registered trademark) of 100 Mbt / s.

  The objects monitored by the network monitoring apparatus 100 are the network devices of the routers R1 to R3. The network monitoring device 100 is connected to the router R2, and can reach other routers via the router R2.

  The first embodiment of the present invention will be described below with reference to FIGS. FIG. 15 is a diagram illustrating a tree describing the connection of each piece of monitoring information in the monitoring rule DB in the network monitoring apparatus 100 according to the first embodiment. As shown in FIG. 15, in the first embodiment, a sudden increase in traffic is detected (predictive discovery), and it is monitored whether there is a packet drop failure associated therewith (failure discovery), and a failure has occurred. In this case, the procedure is to identify (cause identification) which route (interface) the traffic is caused by.

  As the initial monitoring information, the network monitoring apparatus 100 acquires MIB information ifOutOctes (M1, M2, M3) that is the output traffic amount of the interface to the local network of each router, and uses this information using the time-series information determination function. Monitor. Here, it is assumed that the streaming delivery is started from the streaming server H4 at 60 Mbit / s while the streaming is being delivered from the streaming server H3 to the client H1 at 20 Mbit / s. When distribution starts from the streaming server H4, a sudden increase in traffic is detected from the monitoring information M1.

  Since the monitoring information M1 becomes abnormal, the network monitoring apparatus 100 monitors the next monitoring information packet drop, and corresponds to the MIB of the interface corresponding to the child monitoring tree numbers 1.1 and 1.2 in FIG. The information ifOutDiscard (M11) and the router's MIB ipOutDiscard (M12) are acquired and monitored using the integer type information determination function.

  Here, when a packet drop with an abnormal threshold is detected in any of the monitoring information, the network monitoring apparatus 100 next checks the MIB information ipInOctes (M111) of each interface in order to check the amount of input traffic from the routers R2 and R3. , M112) starts monitoring using the integer type information determination function.

  Since the traffic from the streaming server H4 is 60 Mbit / s, in order to detect an abnormality that exceeds the threshold value of 50 Mbit / s set in the monitoring information M112 of the rule DB in advance, the network administrator must It can be seen that the main cause of the failure is the traffic entering the interface IF: 192.168.31.2/24.

  In FIG. 15, there is a part where the IF ID is the same as the Node ID. In this case, it means that the router is checked instead of the IF, and the following FIG. , 17 is the same.

  Next, a second embodiment of the present invention will be described with reference to FIGS. FIG. 16 is a diagram illustrating a tree describing the connection of the monitoring information of the monitoring rule DB in the network monitoring apparatus 100 according to the second embodiment. As shown in FIG. 16, in the second embodiment, an increasing tendency of packet rejection due to an error is detected (forecast detection), and after detection, the routing table of each router is inspected (failure detection). If this occurs, the procedure is to notify the route that has failed and to check whether the cause of the route failure is route rejection by the routing protocol (identification of the cause of failure).

  As the initial monitoring information, the network monitoring apparatus 100 includes MIB information icmpOutTimeExcds (M4, M5, M6) indicating the number of packets discarded because the TTL (Time To Live) value is “0” in each router, and the route is MIB information icmpOutDestUnreach (M7, M8, M9) indicating the number of packets discarded because there is no packet is acquired, and this information is monitored using a time-series information determination function.

  The TTL value is information added to the header of the IP packet to be transmitted. Each time this packet passes through the router, the TTL value is decremented by “1” and the value becomes “0”. Then, the router at that time discards this packet.

  Here, when determining a routing table in an environment where a plurality of routing protocols such as OSPF (Open Shortest Path First) and RIP (Routing Information Protocol) are operating in each router, different routing protocols are used between different routers. It is assumed that a loop has occurred between the routers R1 and R2 due to the adoption of the route. At this time, the network monitoring apparatus 100 detects an abrupt increase in the number of discarded packets with the monitoring information M4 and the monitoring information M5. Since the monitoring information M4 and the monitoring information M5 have become abnormal, the network monitoring apparatus 100 acquires the MIB information ipRouteEntry (M41) that is the route information of the router from all the routers in order to perform the route inspection that is the next monitoring information. Then, the path is inspected using the array type information determination function.

  If a loop is found in this examination and the position of the loop is specified, the manager can take appropriate action by looking at the position information of the loop. Next, in order to determine whether or not the cause of the loop is the rejection of the route, the network monitoring apparatus 100 converts the ipRouteDiscard (M411, M412, M413) indicating the number of route rejections of the router to the integer type information determination function. Use and inspect. Here, since the cause of the occurrence of the loop is the use of a route of a different protocol, the monitoring information M411 and the monitoring information M412 do not become abnormal.

  Assuming that the route is deleted from the routing table due to an abnormality in the routing protocol, one of the monitoring information M7, the monitoring information M8, and the monitoring information M9 becomes abnormal, and the route is checked by the route inspection in the monitoring information M41. Since any of the monitoring information M411, the monitoring information M412 and the monitoring information M413 becomes abnormal after detecting none, the administrator can quickly find out which router has an abnormality in the routing protocol.

  Next, a third embodiment of the present invention will be described with reference to FIGS. FIG. 17 is a diagram illustrating a tree describing the connection of the monitoring information of the monitoring rule DB in the network monitoring apparatus 100 according to the third embodiment. As shown in FIG. 17, the third embodiment detects an increasing tendency of the number of normal packet discards (detection of a sign), and monitors whether a CPU overload failure or a temperature failure that leads to packet rejection has occurred. This is a procedure of (failure discovery), checking whether the process is not running out of control when a CPU overload occurs, and checking the state of the fan if the temperature is abnormal (specifying the cause of failure).

  As initial monitoring information, the network monitoring apparatus 100 includes MIB information ifOutDiscard (MM10, MM12, MM14) indicating the number of normal packets discarded at each interface, and MIB indicating the number of normal packets discarded at each router. Information ipOutDiscard (MM11, MM13, MM15) is acquired and monitored using a time-series information determination function.

  Here, it is assumed that the CPU overflows due to the runaway of the protocol operating in the router R1. Due to the overflow, the routing protocol does not operate correctly, and a route that is not in the current routing table is rejected by R1. At this time, the network monitoring apparatus 100 detects that the number of normal packet rejections gradually increases in the monitoring information MM10 or the monitoring information MM11.

  Since the monitoring information MM10 or the monitoring information MM11 becomes abnormal, the network monitoring apparatus 100 uses the MIB information cpmCPUTotal5sec (MM101) indicating the CPU usage rate of the router in order to monitor the CPU overload and temperature abnormality that are the next monitoring information. ) And MIB information cisEnvMonTemperatureStatusValue (MM111) indicating the temperature state, respectively, and inspecting with the integer type information determination function.

In this inspection, if the CPU usage rate is larger than the threshold value of the monitoring information MM101, the network monitoring apparatus 100 considers that a failure has occurred, and in order to inspect which process is the next cause, MIB information cpmProcessAverageUecs (MM1011) indicating the CPU occupancy is acquired and inspected using the integer type information determination function. Here, if it is larger than the threshold value of the monitoring information MM 1011, the network monitoring apparatus 100 considers that it is abnormal, and notifies the administrator of the process ID.

  Thereby, the administrator can quickly find out which process is abnormal. Even if it is assumed that a temperature failure has occurred, it is possible to promptly notify the administrator which fan has the cause by the same operation as described above.

  In the operation flow shown in the above-described embodiment and each example, the operation procedure is recorded in advance in a recording medium such as a ROM as a program, and this is read and executed by a computer (CPU). Of course, it can be configured as follows.

It is a block diagram which shows the structure of the network management system in embodiment of this invention, and the structure of a monitoring object network. It is a figure which shows the example of the monitoring rule in the monitoring rule DB which the network management system in embodiment of this invention uses. It is a block diagram which shows the structure of the time series information determination function which is a determination function in embodiment of this invention. It is a figure which shows the operation | movement flow of FIG. It is a block diagram which shows the structure of the multiple time series information determination function which is the determination function in embodiment of this invention. It is a figure which shows the operation | movement flow of FIG. It is a block diagram which shows the structure of the integer type information determination function which is a determination function in embodiment of this invention. It is a figure which shows the operation | movement flow of FIG. It is a block diagram which shows the structure of the arrangement | sequence type information determination function which is a determination function in embodiment of this invention. It is a figure which shows the operation | movement flow of FIG. It is a figure which shows the flow of a process of the sequence type determination function in embodiment of this invention. It is a figure which shows each characteristic of the determination function in embodiment of this invention. It is a flowchart which shows the flow of operation | movement of the network management system in embodiment of this invention. It is the block diagram which showed the example of a network structure used for description of the Example of this invention. It is a figure which shows the monitoring rule in 1st Example of this invention. It is a figure which shows the monitoring rule in the 2nd Example of this invention. It is a figure which shows the monitoring rule in the 3rd Example of this invention.

Explanation of symbols

DESCRIPTION OF SYMBOLS 100 Network monitoring system 101 Monitoring information collection part 102 Monitoring information determination part 103 Determination function part 104 Collection monitoring information determination part 105 Monitoring rule DB (database)
106 log storage unit 120 monitoring site 121 monitoring terminal

Claims (17)

  1. A monitoring system for collecting and monitoring information on a plurality of network devices,
    Monitoring rule storage means for storing in advance initial monitoring information to be collected from each of the network devices and related monitoring information as monitoring rules;
    Sign detection means for detecting a sign of failure by processing initial monitoring information collected from the network device;
    Collected monitoring information for searching for monitoring information from the monitoring rule storage means related to the initial monitoring information and identifying the cause of the failure in response to the detection of the warning by the warning detection means, and collecting the searched monitoring information A determination means;
    A post-discovery means for performing failure details determination processing based on the monitoring information collected by the collected monitoring information determining means;
    A network monitoring system comprising:
  2. The initial monitoring information is time series monitoring information that changes in time series,
    The sign finding means is means for statistically processing time-series monitoring information collected up to now, and means for detecting a sign of failure by comparing and determining the statistically processed result and the latest collected information. The network monitoring system according to claim 1, further comprising:
  3. The initial monitoring information is time series monitoring information that changes in time series,
    The predictor detecting means statistically processes the correlation of a plurality of time-series monitoring information collected up to now, and compares and determines the result of the statistical processing and the latest collected information. 2. The network monitoring system according to claim 1, further comprising means for detecting a sign.
  4.   The related monitoring information is route information held by the network device, and the post-discovery means checks the route information to check the normality of the route. 3. The network monitoring system according to any one of 3.
  5.   The network monitoring system according to any one of claims 1 to 3, wherein the related monitoring information is integer type information (integer type monitoring information), and the post-discovery means determines an integer value.
  6.   The monitoring information related to the initial monitoring information stored in the storage means has a tree structure as sequentially related monitoring information, and the collection monitoring information determination means is more detailed from the tree structure. The related monitoring information is sequentially searched to determine the collection of the monitoring information, and the post-discovery means performs a failure detail determination process based on the collected monitoring information. 5. The network monitoring system according to any one of 5.
  7.   The monitoring information is collected using SNMP (Simple Network Management Protocol), and the post-discovery means determines a determination processing function based on a data format defined in MIB (Management Information Base) during the determination processing. The network monitoring system according to claim 1, wherein the network monitoring system is configured as described above.
  8.   If the result of determination of the monitoring information instructed to be collected by the collected monitoring information determining means is normal, the collection of the monitoring information is terminated, and the monitoring information that has become a trigger for monitoring the monitoring information The network monitoring system according to claim 1, further comprising means for releasing an abnormal state.
  9. A monitoring method for collecting and monitoring information of a plurality of network devices,
    Prepare a monitoring rule storage means that stores in advance the initial monitoring information to be collected from each of the network devices and the monitoring information related thereto as monitoring rules;
    A sign detection step of detecting a sign of failure by processing the initial monitoring information collected from the network device;
    Collected monitoring information for collecting the searched monitoring information by searching the monitoring rule storage means for monitoring information specifying the cause of the failure in relation to the initial monitoring information in response to the finding of the sign in the predicting step. A decision step;
    A post-discovery step for performing failure details determination processing based on the monitoring information collected by the collected monitoring information determination step;
    A network monitoring method comprising:
  10. The initial monitoring information is time series monitoring information that changes in time series,
    The predictor detecting step includes a step of statistically processing time-series monitoring information collected so far, and a step of detecting a predictor of failure by comparing and determining the result of the statistical processing and the latest collected information The network monitoring method according to claim 9, further comprising:
  11. The initial monitoring information is time series monitoring information that changes in time series,
    The predictive discovery step includes a step of statistically processing the correlation of a plurality of time-series monitoring information collected up to now, and comparing and determining the result of the statistical processing and the latest collected information, The network monitoring method according to claim 9, further comprising a step of detecting a sign.
  12.   The related monitoring information is route information held by the network device, and the post-discovery step confirms the normality of the route by examining the route information. 11. The network monitoring method according to any one of 11.
  13.   The network monitoring method according to claim 9, wherein the related monitoring information is integer type information (integer type monitoring information), and the post-discovery step determines an integer value.
  14.   The monitoring information related to the initial monitoring information stored in the storage means has a tree structure as sequentially related monitoring information, and the collection monitoring information determination step is more detailed from the tree structure. The related monitoring information is sequentially searched to determine the collection of the monitoring information, and the post-discovery step performs a failure detail determination process based on the collected monitoring information. 13. The network monitoring method according to any one of 13.
  15.   The monitoring information is collected using SNMP (Simple Network Management Protocol), and the post-discovery step determines a determination processing function based on a data format defined in MIB (Management Information Base) during the determination processing. The network monitoring method according to claim 9, wherein the network monitoring method is performed.
  16.   If the result of the determination of the monitoring information instructed to be collected by the collected monitoring information determination step is normal, the collection of the monitoring information is terminated and the monitoring information that is a trigger for monitoring the monitoring information The network monitoring method according to claim 9, further comprising a step of releasing an abnormal state.
  17. A program for causing a computer to execute a monitoring method for collecting and monitoring information of a plurality of network devices,
    Processing to detect a failure sign by processing initial monitoring information collected from the network device; and
    A process of searching the monitoring rule storage means for monitoring information that identifies the cause of the failure in relation to the initial monitoring information in response to the sign discovery, and collecting the searched monitoring information ;
    A post-discovery process for determining a failure detail based on the related monitoring information;
    A computer-readable program comprising:
JP2004101827A 2004-03-31 2004-03-31 Network monitoring system and method, and program Active JP4412031B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2004101827A JP4412031B2 (en) 2004-03-31 2004-03-31 Network monitoring system and method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2004101827A JP4412031B2 (en) 2004-03-31 2004-03-31 Network monitoring system and method, and program

Publications (2)

Publication Number Publication Date
JP2005285040A JP2005285040A (en) 2005-10-13
JP4412031B2 true JP4412031B2 (en) 2010-02-10

Family

ID=35183309

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2004101827A Active JP4412031B2 (en) 2004-03-31 2004-03-31 Network monitoring system and method, and program

Country Status (1)

Country Link
JP (1) JP4412031B2 (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4390649B2 (en) 2004-07-14 2009-12-24 富士通株式会社 Network loop detector
WO2007069702A1 (en) * 2005-12-15 2007-06-21 Cyber Solutions Inc. Network management information collection method and network management system
JP4594869B2 (en) 2006-01-24 2010-12-08 富士通株式会社 Condition monitoring device
JP4758259B2 (en) * 2006-01-31 2011-08-24 株式会社クラウド・スコープ・テクノロジーズ Network monitoring apparatus and method
JP4572211B2 (en) * 2007-03-30 2010-11-04 エヌイーシーコンピュータテクノ株式会社 Network system, network relay device
WO2008149975A1 (en) 2007-06-06 2008-12-11 Nec Corporation Communication network failure cause analysis system, failure cause analysis method, and failure cause analysis program
JP5098821B2 (en) * 2008-06-02 2012-12-12 富士通株式会社 Monitoring device and monitoring method for detecting a sign of failure of monitored system
WO2010061735A1 (en) 2008-11-27 2010-06-03 インターナショナル・ビジネス・マシーンズ・コーポレーション System for assisting with execution of actions in response to detected events, method for assisting with execution of actions in response to detected events, assisting device, and computer program
US8645769B2 (en) 2010-01-08 2014-02-04 Nec Corporation Operation management apparatus, operation management method, and program storage medium
JP5831849B2 (en) 2010-11-17 2015-12-09 日本電気株式会社 Violation sign condition setting support system, violation sign condition setting support method, and violation sign condition setting support program
JP2012169958A (en) * 2011-02-16 2012-09-06 Kddi Corp Realtime quality analysis device and method
JP5772112B2 (en) * 2011-03-18 2015-09-02 富士通株式会社 Transmission apparatus and information acquisition control method
JP5883770B2 (en) 2012-11-15 2016-03-15 株式会社日立製作所 Network abnormality detection system and analysis device
JP5958354B2 (en) * 2013-01-16 2016-07-27 富士通株式会社 Communication monitoring apparatus, occurrence prediction method, and occurrence prediction program
JP5987701B2 (en) 2013-01-16 2016-09-07 富士通株式会社 Communication monitoring apparatus, prediction method, and prediction program
DE112013006635T5 (en) * 2013-05-16 2015-10-29 Hitachi, Ltd. Detection device, detection method and recording medium
IN2013MU03382A (en) * 2013-10-25 2015-07-17 Tata Consultancy Services Ltd
JP6574332B2 (en) * 2015-03-26 2019-09-11 株式会社日立システムズ Data analysis system
JP2019053544A (en) * 2017-09-15 2019-04-04 株式会社Fuji Stocker

Also Published As

Publication number Publication date
JP2005285040A (en) 2005-10-13

Similar Documents

Publication Publication Date Title
US20160170818A1 (en) Adaptive fault diagnosis
US9843488B2 (en) Method and system for confident anomaly detection in computer network traffic
US9237075B2 (en) Route convergence monitoring and diagnostics
Kompella et al. Fault localization via risk modeling
EP3304822B1 (en) Method and apparatus for grouping features into classes with selected class boundaries for use in anomaly detection
US10275301B2 (en) Detecting and analyzing performance anomalies of client-server based applications
US20200106662A1 (en) Systems and methods for managing network health
JP4371905B2 (en) Unauthorized access detection device, unauthorized access detection method, unauthorized access detection program, and distributed service disablement attack detection device
US7954010B2 (en) Methods and apparatus to detect an error condition in a communication network
US6457143B1 (en) System and method for automatic identification of bottlenecks in a network
US8635498B2 (en) Performance analysis of applications
Zhuang et al. On failure detection algorithms in overlay networks
US9014012B2 (en) Network path discovery and analysis
US7010718B2 (en) Method and system for supporting network system troubleshooting
Kimura et al. Spatio-temporal factorization of log data for understanding network events
Chhabra et al. Distributed spatial anomaly detection
US8811395B2 (en) System and method for determination of routing information in a network
WO2013186870A1 (en) Service monitoring system and service monitoring method
AU2003257943B2 (en) Method and apparatus for outage measurement
US20140165207A1 (en) Method for detecting anomaly action within a computer network
US6941367B2 (en) System for monitoring relevant events by comparing message relation key
US7082554B2 (en) System and method for providing error analysis and correlation in a network element
US10129115B2 (en) Method and system for network monitoring using signature packets
US8583779B2 (en) Root cause analysis approach with candidate elimination using network virtualization
DE602005000383T2 (en) Error detection and diagnostics

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20070115

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20090529

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20090609

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20090807

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20091027

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20091109

R150 Certificate of patent or registration of utility model

Ref document number: 4412031

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20121127

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20121127

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20131127

Year of fee payment: 4