WO2001050266A1 - System and method for topology based monitoring of networking devices - Google Patents

System and method for topology based monitoring of networking devices Download PDF

Info

Publication number
WO2001050266A1
WO2001050266A1 PCT/US2000/035711 US0035711W WO0150266A1 WO 2001050266 A1 WO2001050266 A1 WO 2001050266A1 US 0035711 W US0035711 W US 0035711W WO 0150266 A1 WO0150266 A1 WO 0150266A1
Authority
WO
WIPO (PCT)
Prior art keywords
monitoring
network
component
recited
components
Prior art date
Application number
PCT/US2000/035711
Other languages
French (fr)
Inventor
William Gaske
Original Assignee
Computer Associates Think, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Associates Think, Inc. filed Critical Computer Associates Think, Inc.
Priority to AU26117/01A priority Critical patent/AU2611701A/en
Publication of WO2001050266A1 publication Critical patent/WO2001050266A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/40Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection

Definitions

  • the present disclosure relates generally to network monitoring, and in particular, to a
  • devices are typically connected on various segments of a network
  • the various segments of the network may in turn be connected to one another using
  • failure of one of these connectivity devices can cause multiple device failure messages for all
  • FIG. 6 An example of a typical network arrangement is shown in Fig. 6.
  • the representation shown in Fig. 6 depicts a network tree 201 which shows the interconnectivity of the devices on
  • the network includes a monitoring device 200, monitored devices 202, 204 and
  • router 206 connected to a hub 208. Router 206 is also connected to hub 210, along with
  • Monitoring device 200 is capable of actively monitoring and
  • router 206 (e.g., hub 210, and monitored devices 212-216) will also fail. Monitoring device 200
  • monitoring device 200 would be able to quickly identify and
  • device 200 would be able to quickly identify and report router 206 as having failed and identify
  • One possible method of identifying the failed device on the network may include
  • the present disclosure relates to a method and system for monitoring a network
  • the method and system comprises providing a monitoring
  • the method and system may further comprise classifying the component in the chain
  • the method and system may further comprise
  • the topology of the network being determined
  • the method and system may comprise a network monitoring application running on a
  • the network monitoring application may represent the network as a
  • Figure 1 shows an example of a network used for explaining embodiments of the present
  • FIG. 2 is a block diagram of a monitoring system according to an embodiment
  • Figure 3 is a flow chart representing the steps carried out during monitoring according to
  • Figures 4A and 4B show a tree structure and diagrammatic illustration, respectively, of
  • Figures 5A - 5E show events in various event queues at specific points in time for
  • Figure 6 depicts a network configuration for explaining network monitoring
  • Figure 7 is a block diagram depicting exemplary components capable of being monitored
  • each specific element includes all technical equivalents which operate in a similar manner.
  • Fig. 1 depicts an exemplary network 1 to which the present system and method may be
  • a monitoring system 20 may be connected to devices such as a network printer 10 and a computer workstation 12, via hub 26.
  • the devices on hub 26 are capable of communicating via router 30.
  • the devices on network 1 are
  • network including, for example, network
  • facsimile devices other servers, workstations, printers, hubs, routers, etc.
  • the network such as an ethernet or token ring network.
  • the network could be a bridged ethernet or
  • token ring network or a combination of token ring and ethernet networks connected by one or
  • the network could also be a Wide Area Network (WAN) including the
  • Internet or an intranet system for example, using the TCP/IP protocol to communicate and
  • ethernet LANs (LAN 1 and LAN 2) in a star configuration are linked together by router 30.
  • the components communicating over the network use low level protocols dependent on
  • executing on the monitoring device 20 is provided with the functionality to communicate with
  • monitoring application 22 can
  • monitoring device may itself be monitored by another device running a monitoring application.
  • APIs application program interfaces
  • Monitoring system 20 may be a standard PC, laptop, mainframe, etc. capable of running a
  • FIG. 2 depicts a block
  • monitoring system 20 may include.
  • monitoring may include.
  • system 20 may not include each element shown and/or may include additional elements not
  • monitoring system 20 may include a central processing unit (CPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU), a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62, a graphics processing unit (GPU) 62,
  • LAN controller 74 a LAN interface 76, a network controller 78, an internal bus 80 and one or more
  • input devices 72 such as, for example, a keyboard and mouse.
  • CPU 62 controls the operation of system 20 and is capable of running applications stored
  • Memory 64 may include, for example, RAM, ROM, removable CDROM, DVD,
  • Memory 64 may also store various types of data necessary for the execution of the applications, as well as a work area reserved for use by CPU 62.
  • Clock circuit 66 may include a
  • circuit for example, for generating information indicating the present time.
  • the LAN interface 76 allows communication between the ethernet LAN 1 , and the LAN
  • the LAN data transmission controller 74 uses a predetermined
  • Monitoring system 20 may also be capable of communicating with and/or monitoring the devices
  • System 20 may also be capable of
  • PSTN Public Switched Telephone Network
  • Internal bus 80 which may actually consist of a plurality of buses, allows
  • Each of the devices on network 1 may be capable of being monitored. Accordingly, as
  • FIG. 1 network printer 10, workstation 12, workstation 14, server 16 and network
  • printer 18 are identified as monitored devices 1-5, respectively, as shown in Fig. 1 .
  • hubs 24, 26 and router 30 are also capable of being
  • Each of the devices on network 1 may include one or more of the components shown in
  • a device may include one or more system event logs 40, one or more system error log files 42, one or more application log files 44, TCP/IP port
  • performance data 50 and/or physical network interface (availability) information 52.
  • the monitoring application 22 run by monitoring system 20 is capable of monitoring one
  • monitoring application 22 maintains a set of queues, each queue corresponding to one of the
  • each queue includes one or more monitoring events, each monitoring event in
  • queues could alternatively be combined into a single
  • the monitoring event for that device is
  • Monitoring device 20 has access to information showing the structure or configuration of
  • This information may be in the form of, for example, a tree structure or network
  • the network configuration information identifies the devices on the network and how
  • the network configuration information may be
  • the network configuration information may include a set of structures or objects each representing a device on the network, and a linked list of pointers to other devices.
  • pointers may include pointers to parent devices.
  • the network configuration information may, for
  • the network configuration information may, for purposes of illustration, be
  • parent or dependent device exists above and at one level less of indentation.
  • parent or dependent device exists above and at one level less of indentation.
  • monitoring device 20 and hub 26 do not have any parent or dependent device.
  • configuration information may be obtained automatically using a network discovery system or
  • the network configuration information represents the route taken
  • a first device can be said to be linked to a second device one level higher in the tree if: a
  • the first device and there are no further monitored devices between the first and second devices
  • monitored device 2 or router 30 will not affect any other device directly connected to
  • hub 26 may cause the entire section of the network (e.g., all
  • monitored devices 1 , 2 and router 30 can
  • monitoriim svstcm 20 The netw oi k configuration information can be automatically established by monitoring
  • system 20 by suitably polling the devices on network 1 in a known manner
  • the network
  • Monitoi ing application 22 is configuied to penodically monitor each device on network 1
  • Fig 3 is a llowchart for describing the processing of a monitoiing event accoidmg to an
  • Step S2 monitoring of the fu st device in the queue is started A
  • a monitoi ing event may include monitoring TCP/IP port connectivity
  • Step S6 a determination is made in Step S6 whether monitored device 3's queue includes a previous device failure indication. If Yes in Step S6, the
  • Step S7 If No in Step S6,
  • Step S5 the next device to be monitored is selected (Step S5), monitoring is started (Step S2) and the
  • Step S4 process returns to Step S4 for determining whether the monitoring event for the device was
  • Step S4 If the monitoring event was not successful (No, Step S4), the event queue for the
  • Step S8 A determination is then made whether the device being monitored is purged.
  • a parent device may be considered a device on a level immediately "above" a device in
  • a parent device exists above and at one
  • Step S10 If the device does not have a parent (No, Step S10), the failure of the
  • Step S12 If the device has a parent (Yes, Step 10), the
  • Step SI 6 the failure of the child device is added to the parent device's queue (Step SI 6) and monitoring is
  • Step S2 The procedure then returns to Step S4 and monitoring of the parent device is
  • Step S4 and its event queue includes a child device's failure indication (Yes, Step S6) or
  • scheduling a device for "immediate monitoring” refers to placing the device at the head of the
  • FIG. 5A depicts the state of the monitoring queue at this point in
  • monitored device 3 (workstation 14) is the next
  • Monitoring of monitored device 3 begins (Step
  • Step S4 a determination is made whether the monitoring event for monitored device 3
  • Step S5 the device in the queue would have been selected to be monitored.
  • Step S5 the device in the queue would have been selected to be monitored.
  • this monitoring request fails due to the catastrophic failure of hub 26 (No, Step S4).
  • Step S8 the event queue of monitored device 3 is purged (Step S8), since monitored device
  • Step S10 A determination is then made in Step S10 whether monitored device 3 has a
  • Step S12 the failure of monitoring device 3 would be processed.
  • Processing of the failure may include notification of the failure of monitoring device 3 to a user
  • hub 24 since monitored device 3 has a parent device (hub 24) (Yes, Step S10), hub 24 is
  • Step S I 4 "Failure of MD3" is then placed into the
  • the monitoring event queue is as shown in Fig. 5B.
  • hub 24 is the next device to be monitored and Hub 24's queue ir dicates that monitored device 3 failed. Monitoring of hub 24
  • Step S2 is then started (Step S2) and the process repeats.
  • Step S6 indicates that monitored
  • Step S7 the monitoring of hub 24 fails (No, Step S4).
  • Step S8 the event queue for hub 24 is purged (Step S8), to remove the "Failure of MD3"
  • Step S10 If hub 24 has a parent
  • Step S 10 the failure of hub 24 would be processed so that notification of the failure of hub 24
  • Step S 10 router 30 is rescheduled for immediate monitoring (Step S14).
  • Fig. 5C depicts the monitoring
  • Step S2 Monitoring of router 30 is then started.
  • Step S4 If the monitoring of router 30 succeeded (Yes , Step S4), this would indicate that router
  • Step S6 it could be
  • Step S8 event queue for router 30 is purged (Step S8), to remove the "failure of hub 24" event from its
  • Step S10 A determination is then made in Step S10 whether router 30 has a parent device by
  • router 30 did not have a parent device
  • Step SI 4 hub 26 is rescheduled for immediate monitoring
  • Fig 5D depicts the monitoring
  • Step S4 If the monitoring of hub 26 succeeded (Yes, Step S4), by referring the event queue (Step S6), by referring the event queue (Step S4), by referring the event queue (Step S4), by referring the event queue (Step S4), by referring the event queue (Step S4), by referring the event queue (Step S4), by referring the event queue (Step S4), by referring the event queue (Step S4).
  • Step S7 the monitoring of hub 26 fails (No, Step S4) and
  • Step S8 to remove the "Failure of Router 30" event from its
  • Step SI 0 If hub 26 has a parent device by
  • hub 26 since hub 26 does not have any parent
  • Step S12 the failure of the current device (hub 26) is processed (Step S12).
  • hub 26 would be classified as a failed component, and the availability for
  • monitored device 1 monitored device 2
  • monitored device 3 monitored device 4, monitored device 5, router 30 and hub 24. This can
  • failed devices can be created as they are determined and then classified as unknown status as
  • any "failure of dependent device" event in the queue can be deleted
  • Step S8 of Fig. 3 only the "failure of dependent device" portion of the queue can be purged, thus
  • the device's queue can then be suspended for a set period of time or until it is determined that the
  • the present system and method may be conveniently implemented using one or more

Abstract

A method of monitoring a network (1) including a plurality of components comprises providing a monitoring request to a component for monitoring the network for an apparent failure of the component on the network and immediately monitoring availability of a chain of components between a failed component and a monitoring system to establish which component in the chain is causing the apparent failure in subsequent components.

Description

SYSTEM AND METHOD FOR TOPOLOGY BASED MONITORING OF
NETWORKING DEVICES
Reference to Related Application
This present application claims the benefit of provisional Application Serial No.
60/173,816, filed December 30, 1999, which is hereby incorporated herein by reference.
BACKGROUND
1. Field of the Disclosure
The present disclosure relates generally to network monitoring, and in particular, to a
system and method for topology based monitoring of networking devices.
2. Description of The Related Art
In a networked environment, devices are typically connected on various segments of a
network. The various segments of the network may in turn be connected to one another using
some level of connectivity hardware, such as a hub, bridge or router. When a monitoring system
is inserted into the network to monitor the status of the devices connected to the network, a
failure of one of these connectivity devices can cause multiple device failure messages for all
devices on the other side of the connectivity hardware, rather than specifically identifying the
connectivity device that failed. n addition, a failure of a device connected between the
monitoring system and devices on the other side of the failed device will also cause multiple
device failure messages, rather than specifically identifying the device that failed.
An example of a typical network arrangement is shown in Fig. 6. The representation shown in Fig. 6 depicts a network tree 201 which shows the interconnectivity of the devices on
the network. The network includes a monitoring device 200, monitored devices 202, 204 and
router 206 connected to a hub 208. Router 206 is also connected to hub 210, along with
monitored devices 212-216. Monitoring device 200 is capable of actively monitoring and
actively verifying the availability of the devices on the network, using any one of a plurality of
known network diagnostic or network application requests.
In the network arrangement shown in Fig. 6, if a catastrophic failure such as a power
outage, for example, should occur in hub 208, which renders it unable to transport network
requests from monitoring device 200 to monitored devices 202, 204 and router 206, for example,
all requests to these devices will fail. In addition, all requests to all devices on the other side of
router 206 (e.g., hub 210, and monitored devices 212-216) will also fail. Monitoring device 200
will see this as a lack of availability of all devices on the network, and report all devices as
having failed. Ideally, however, monitoring device 200 would be able to quickly identify and
report hub 208 as a failed device and report the other devices on the network as unavailable. In
another example, if router 206 were to encounter a catastrophic failure instead of hub 208, all
requests to router 206, hub 210 and monitored devices 212-216 would fail. Ideally monitoring
device 200 would be able to quickly identify and report router 206 as having failed and identify
and report hub 210 and any devices connected thereto (monitored devices 212-216) as
unavailable.
One possible method of identifying the failed device on the network may include
performing an analysis using an event correlation type system which attempts to identify the
specific failed component within the network after receiving and correlating failure messages from multiple networked devices. However, it is often difficult to effectively and efficiently
achieve this result, requiring a considerable amount of time and processing.
SUMMARY
The present disclosure relates to a method and system for monitoring a network
including a plurality of components. The method and system comprises providing a monitoring
request to a component for monitoring the network for an apparent failure of the component on
the network, and immediately monitoring availability of a chain of components between a failed
component and a monitoring system to establish which component in the chain is causing the
apparent failure in subsequent components.
The method and system may further comprise classifying the component in the chain
causing the apparent failure in subsequent components as failed and components reachable
through the failed component as status unknown. The method and system may further comprise
determining a topology of the network, the topology of the network being determined
automatically using a network discovery mechanism, or manually specified by a user.
The method and system may comprise a network monitoring application running on a
network monitoring node. The network monitoring application may represent the network as a
tree with the network monitoring node at a top of the tree.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and other features of the disclosure will be apparent upon consideration of the following detailed description, when read in conjunction with the accompanying drawings, in
which the reference character's refe_ to like parts throughout and which:
Figure 1 shows an example of a network used for explaining embodiments of the present
disclosure;
Figure 2 is a block diagram of a monitoring system according to an embodiment;
Figure 3 is a flow chart representing the steps carried out during monitoring according to
an embodiment;
Figures 4A and 4B show a tree structure and diagrammatic illustration, respectively, of
the network configuration shown in Fig. 1 ;
Figures 5A - 5E show events in various event queues at specific points in time for
explaining an example of the present system and method;
Figure 6 depicts a network configuration for explaining network monitoring; and
Figure 7 is a block diagram depicting exemplary components capable of being monitored
in a monitored device.
DETAILED DESCRIPTION
In describing the preferred embodiments illustrated in the drawings, specific
terminology is employed for sake of clarity. However, the disclosed embodiments are not
intended to be limited to the specific terminology so selected and it is to be understood that
each specific element includes all technical equivalents which operate in a similar manner.
Fig. 1 depicts an exemplary network 1 to which the present system and method may be
applied. For example, in this embodiment, a monitoring system 20 may be connected to devices such as a network printer 10 and a computer workstation 12, via hub 26. Workstation 14, server
16 and network printer 18 are connected to each other via hub 24. Devices on hub 24 and
devices on hub 26 are capable of communicating via router 30. The devices on network 1 are
described above as being "connected". This refers to the ability of the devices to communicate
or otherwise inter ace to the network via links 4 and not necessarily to a physical connection. Of
course, other types of devices may be provided on the network, including, for example, network
facsimile devices, other servers, workstations, printers, hubs, routers, etc.
Applications 22 running on monitoring system 20 can communicate with these other
devices on hub 26 and hub 24 via suitable interfaces and protocols based on the operating
systems on the network and the network architecture used. For example, an exemplary network
to which the present system and method can be applied might be an un-bridged local area
network, such as an ethernet or token ring network. The network could be a bridged ethernet or
token ring network, or a combination of token ring and ethernet networks connected by one or
more routers or switches. The network could also be a Wide Area Network (WAN) including the
Internet or an intranet system, for example, using the TCP/IP protocol to communicate and
connect with other local area networks. An embodiment of a monitoring system according to the
present disclosure is capable of operating on a network 1 as shown in Figure 1 in which two
ethernet LANs (LAN 1 and LAN 2) in a star configuration are linked together by router 30.
The components communicating over the network use low level protocols dependent on
the transport medium used and higher level protocols dependent on the operating systems and
applications executing on the linked network components. A discussion regarding details of the
communication between the network components is not necessary for an understanding of the present disclosure. It is assumed that the monitoring application 22 or another application
executing on the monitoring device 20 is provided with the functionality to communicate with
the other specific network components on network 1 and to thereby monitor the components and
establish whether they are operating appropriately. Of course, the monitoring application 22 can
be provided on one of the other devices shown in Fig. 1 or on a device remote to hubs 24, 26 and
used to monitor the hubs themselves and one or more of the other devices on the hubs. A
monitoring device may itself be monitored by another device running a monitoring application.
It is assumed hereinafter that suitable protocols and network operating systems are in
place and that the interfaces, files, ports and other components of the remote devices can be
accessed through application program interfaces (APIs) or appropriate remote procedure calls
(RPCs). Such arrangements are very well know and utilized on most networked devices today.
Monitoring system 20 may be a standard PC, laptop, mainframe, etc. capable of running a
monitoring application according to an embodiment described herein. Fig. 2 depicts a block
diagram of exemplary elements monitoring system 20 may include. Of course, monitoring
system 20 may not include each element shown and/or may include additional elements not
shown. As shown, monitoring system 20 may include a central processing unit (CPU) 62, a
memory 64, a clock circuit 66, a printer interface 68, a display unit 70, a LAN data transmission
controller 74, a LAN interface 76, a network controller 78, an internal bus 80 and one or more
input devices 72 such as, for example, a keyboard and mouse.
CPU 62 controls the operation of system 20 and is capable of running applications stored
in memory 64. Memory 64 may include, for example, RAM, ROM, removable CDROM, DVD,
etc. Memory 64 may also store various types of data necessary for the execution of the applications, as well as a work area reserved for use by CPU 62. Clock circuit 66 may include a
circuit, for example, for generating information indicating the present time.
The LAN interface 76 allows communication between the ethernet LAN 1 , and the LAN
data transmission controller 74. The LAN data transmission controller 74 uses a predetermined
protocol suite to exchange information and data with the other devices on the network 1.
Monitoring system 20 may also be capable of communicating with and/or monitoring the devices
on LAN 2 via router 30 and/or on other networks. System 20 may also be capable of
communicating with other devices via a Public Switched Telephone Network (PSTN) using
network controller 78. Internal bus 80, which may actually consist of a plurality of buses, allows
communication between each of the components connected thereto.
Each of the devices on network 1 may be capable of being monitored. Accordingly, as
depicted in Fig. 1 , network printer 10, workstation 12, workstation 14, server 16 and network
printer 18 are identified as monitored devices 1-5, respectively, as shown in Fig. 1 . Although not
specifically identified as monitored devices, hubs 24, 26 and router 30 are also capable of being
monitored. Of course, additional devices, whether capable of being monitored or not, may be
provided on network 1.
Each of the devices on network 1 may include one or more of the components shown in
Fig. 7, each of which is capable of being monitored by system 20. The definition of what
constitutes a component as used in this description is a matter of choice depending on the
implementation. However, the following are examples of the types of components that may be
monitored and the present embodiments envision the monitoring of any component capable of
being monitored. As shown in Fig. 7, a device may include one or more system event logs 40, one or more system error log files 42, one or more application log files 44, TCP/IP port
(connectivity) information 46, services/processes running status information 48, physical
performance data 50 and/or physical network interface (availability) information 52.
The monitoring application 22 run by monitoring system 20 is capable of monitoring one
or more of the components shown in Fig. 7 for each of the devices on network 1 . The
monitoring application 22 maintains a set of queues, each queue corresponding to one of the
devices being monitored. The queue structures will be described in more detail later below.
Briefly, however each queue includes one or more monitoring events, each monitoring event in
each queue containing an instruction for monitoring a specific component (see Fig. 7) on the
associated monitored device. Further, the queues could alternatively be combined into a single
queue identifying the component associated with each event. Once a monitoring event in the
queue is triggered, an independently executing task or thread is generated which contains
instructions to wait for a response to the monitoring event and then process the results in a
manner consistent w ith the results. If an error message or response is received from the device or
if no response is received within a set amount of time, the monitoring event for that device is
deemed failed.
Monitoring device 20 has access to information showing the structure or configuration of
the network. This information may be in the form of, for example, a tree structure or network
topology. The network configuration information identifies the devices on the network and how
the devices are arranged on the network. The network configuration information may be
maintained in memory 64 of monitoring device 20, or otherwise accessible by the monitoring
application 22. The network configuration information may include a set of structures or objects each representing a device on the network, and a linked list of pointers to other devices. The
pointers may include pointers to parent devices. The network configuration information may, for
purposes of illustration, be represented as a tree structure, as shown in Fig. 4A. In the
alternative, the network configuration information may, for purposes of illustration, be
represented diagrammatically as shown in Fig. 4B. As shown in Fig. 4B, for each device, its
parent or dependent device exists above and at one level less of indentation. In this example,
monitoring device 20 and hub 26 do not have any parent or dependent device. The network
configuration information may be obtained automatically using a network discovery system or
may be manually specified by a user.
As show n in Fig. 4A, the network configuration information represents the route taken
from device to dc\ ice by data transmitted to and from the monitoring device. Stated another
way, a first device can be said to be linked to a second device one level higher in the tree if: a
failure of the second device would prevent communication between the monitoring device and
the first device: and there are no further monitored devices between the first and second devices
which arc instrumental m communication of data between the first device and the monitoring
device.
For example, with the network configuration shown in Figure 4A a failure in monitored
device 1. monitored device 2 or router 30 will not affect any other device directly connected to
hub 26. However, a failure of hub 26 itself may cause the entire section of the network (e.g., all
devices connected to hub 26) to cease operating. Thus, monitored devices 1 , 2 and router 30 can
be said to be independent, but are dependent on hub 26 to communicate with each other and the
monitoriim svstcm 20. The netw oi k configuration information can be automatically established by monitoring
system 20 by suitably polling the devices on network 1 in a known manner The network
configuiation mfoi mation need not be part of a laigei data structure, such as a detailed
repiesentation of the netwoik configuration, as long as devices instrumental in transmitting data
between nodes can be ascertained from the structure
Monitoi ing application 22 is configuied to penodically monitor each device on network 1
at scheduled mtei v als Following successful monitoring of a device, the next monitoring time
foi that de\ ice is l escheduled based on a set monitoring interval An example of a monitoring
schedule table is show n in Figui e 5A Figure 5A shows all device queues merged into a single
table foi clai ιt\ m discussion Of com se, each queue foi each device may actually be arranged
sepai ately 1 he schedule may be laid out as a table including a "Queue Foi" column 31 , an
"Event" column 32 and a "Monitoiing Scheduled" column 33 Column 31 identifies the device
queue Column 32 descnbes the event taking place for the device identified m column 31 ,
indicating, loi example, cithei that the device is scheduled to be monitored oi identifying a
device that fai led duπng a pi evious monitoring event Column 33 indicates the time that the
event occui i cd
Fig 3 is a llowchart for describing the processing of a monitoiing event accoidmg to an
embodiment Initially, in Step S2, monitoring of the fu st device in the queue is started A
deteimmation w hethei the monitoring event for the selected device was successful is then made
(Step S4) Foi example, a monitoi ing event may include monitoring TCP/IP port connectivity
sei vices/piocesses I mining mfoimation 48 (Fig 7) monitored device 3 (woikstation 14) If the
monitoiing cnt w as successful (Yes, Step S4), a determination is made in Step S6 whether monitored device 3's queue includes a previous device failure indication. If Yes in Step S6, the
failure of the previous device indicated as having failed is processed (Step S7). If No in Step S6,
the next device to be monitored is selected (Step S5), monitoring is started (Step S2) and the
process returns to Step S4 for determining whether the monitoring event for the device was
successful. If the monitoring event was not successful (No, Step S4), the event queue for the
device being monitored is purged (Step S8). A determination is then made whether the device
has a parent, by reference to the network configuration information (e.g., see Figs 4A, 4B). In
Fig. 4A, a parent device may be considered a device on a level immediately "above" a device in
the tree structure and having a link thereto. In Fig. 4B, a parent device exists above and at one
level less of indentation. If the device does not have a parent (No, Step S10), the failure of the
device just monitored is processed (Step S12). If the device has a parent (Yes, Step 10), the
parent device is rescheduled for immediate monitoring (Step S14), and information identifying
the failure of the child device is added to the parent device's queue (Step SI 6) and monitoring is
started (Step S2). The procedure then returns to Step S4 and monitoring of the parent device is
performed in a similar manner. This process continues until a device is successfully monitored
(Yes, Step S4) and its event queue includes a child device's failure indication (Yes, Step S6) or
until it is determined that a failed device does not have a parent (No, Step S l ϋ). The reference to
scheduling a device for "immediate monitoring" refers to placing the device at the head of the
queue table to be monitored next, and not necessarily to the exact time frame for which the
monitoring event is to occur.
The processings performed in Fig. 3 will now be further explained by reference to an
example, using the network tree configuration information as shown in Figs. 4A and 4B. In this example, it is assumed that hub 26 av; ilability has been verified, and hub 26 has been
rescheduled for monitoring at a later t me. It is further assumed that this is followed by a
catastrophic failure of hub 26. Fig. 5A depicts the state of the monitoring queue at this point in
time 34.
In this example, as shown in Fig. 5A, monitored device 3 (workstation 14) is the next
device in the queue scheduled to be monitored. Monitoring of monitored device 3 begins (Step
S2) In Step S4, a determination is made whether the monitoring event for monitored device 3
was successful. If there had not been a catastrophic failure of hub 26, the monitoring event
would have been successful (Yes, Step S4) in this example, and since the event queue for
monitored device 3 would not include a child devices failure indication (No, Step S6) the next
device in the queue would have been selected to be monitored (Step S5). However, in this
example, this monitoring request fails due to the catastrophic failure of hub 26 (No, Step S4).
Accordingly, the event queue of monitored device 3 is purged (Step S8), since monitored device
3 is unreachable. A determination is then made in Step S10 whether monitored device 3 has a
parent, by referring to the network configuration as shown in Figs. 4A, 4B. If there were no
parent device (No. Step S 1 ) the failure of monitoring device 3 would be processed (Step S12).
Processing of the failure may include notification of the failure of monitoring device 3 to a user
via a monitor and/or storage of information identifying the failure for future reference. However,
in this case, since monitored device 3 has a parent device (hub 24) (Yes, Step S10), hub 24 is
then rescheduled for immediate monitoring (Step S I 4). "Failure of MD3" is then placed into the
event queue for hub 24, (Step SI 6) identifying that monitored device 3 failed. At this point in
time, the monitoring event queue is as shown in Fig. 5B. As shown, hub 24 is the next device to be monitored and Hub 24's queue ir dicates that monitored device 3 failed. Monitoring of hub 24
is then started (Step S2) and the process repeats.
If the monitoring of hub 24 had succeeded (Yes, Step S4), this would indicate that hub 24
was operating properly and looking at hub 24's event queue (Step S6) indicates that monitored
device 3 had failed. The "Failure of MD3" event, placed in the queue of hub 24 would then be
processed (Step S7). However, in this example, the monitoring of hub 24 fails (No, Step S4).
Accordingly, the event queue for hub 24 is purged (Step S8), to remove the "Failure of MD3"
event from its queue. A determination is then made in Step S10, whether hub 24 has a parent
device by reference to the network configuration. If hub 24 did not have a parent device (No,
Step S 10), the failure of hub 24 would be processed so that notification of the failure of hub 24
would be provided. However, in this example, since hub 24 has a parent (router device 30) (Yes,
Step S 10), router 30 is rescheduled for immediate monitoring (Step S14). The "Failure of hub
24" is then placed into the event queue for router 30 (Step S14). Fig. 5C depicts the monitoring
event queue at this point in time. Monitoring of router 30 is then started (Step S2).
If the monitoring of router 30 succeeded (Yes , Step S4), this would indicate that router
30 was operating properly and by referring to router 3()'s event queue (Step S6) it could be
determined that hub 24 had failed. The "Failure of hub 24" event would then be processed (Step
S7). However, in this example, the monitoring of router 30 fails (No, Step S4). Accordingly, the
event queue for router 30 is purged (Step S8), to remove the "failure of hub 24" event from its
queue. A determination is then made in Step S10 whether router 30 has a parent device by
reference to the network configuration information. If router 30 did not have a parent device
(No, Step S 1 ). the failure of router 30 would be processed so that notification of the failure of router 30 would be provided. However, in this example, since router 30 has a parent (hub 26)
(Yes, Step S 10), hub 26 is rescheduled for immediate monitoring (Step SI 4). The "Failure of
router 30" is placed into the event queue for hub 1 (Step SI 6). Fig 5D depicts the monitoring
event queue at this point in time. Monitoring of hub 26 is then begun (Step S2).
If the monitoring of hub 26 succeeded (Yes, Step S4), by referring the event queue (Step
S6) it could be determined that router 30 had failed. The "Failure of Router 30" would then be
processed (Step S7). However, in this example, the monitoring of hub 26 fails (No, Step S4) and
its event queue is thus purged (Step S8), to remove the "Failure of Router 30" event from its
queue. The determination is then made in Step SI 0 whether hub 26 has a parent device by
reference to the network configuration. In this example, since hub 26 does not have any parent
device (No. Step S 1 ), the failure of the current device (hub 26) is processed (Step S12).
At this point, hub 26 would be classified as a failed component, and the availability for
devices dependent on hub 26 would be classified as having unknown status. In this topology, the
devices classi ied as status unknown would be monitored device 1 , monitored device 2,
monitored device 3. monitored device 4, monitored device 5, router 30 and hub 24. This can
easily be accomplished by suitably recursively navigating the tree. In the alternative, a list of
failed devices can be created as they are determined and then classified as unknown status as
appropriate. In addition, it may be desirable to purge all the queues for the devices downstream
of hub 26 to ready for the next monitoring event (see Fig. 5E).
In an alternative embodiment, instead of purging all events in a queue when a device is
known to be unreachable, any "failure of dependent device" event in the queue can be deleted
from the queue and the queue can be suspended. This allows scheduled events associated with unreachable components to be maintained and possibly executed when the failed component is
functioning correctly again. For example, instead of completely purging the device's queue in
Step S8 of Fig. 3, only the "failure of dependent device" portion of the queue can be purged, thus
maintaining future scheduled monitoring events for the device in the device's queue. The
device's queue can then be suspended for a set period of time or until it is determined that the
failed component is functioning properly.
The present system and method may be conveniently implemented using one or more
conventional general purpose digital computers and/or servers programmed according to the
teachings of the present specification. Appropriate software coding can readily be prepared by
skilled programmers based on the teachings of the present disclosure. The present system and
method may also be implemented by the preparation of application specific integrated circuits
or by interconnecting an appropriate network of conventional component circuits.
Numerous additional modifications and variations of the present system and method are
possible in view of the above-teachings. It is therefore to be understood that within the scope
of the appended claims, the present disclosure may be practiced other than as specifically
described herein.

Claims

WHAT IS CLAIMED IS
1 A method of monitoring a netwoik including a plurality of components comprising
ld g a monitoring lequest to a component for monitoring the network for an
appaient failui e of the component on the network, and
immediateh monitoring availability of a chain of components between a failed
component and a monitoring system to establish which component in the chain is causing the
apparent failuie in subsequent components
2 A method of monitoring a netwoik as recited in claim 1 , fuithei compi ismg classifying
the component in the chain causing the apparent failuie in subsequent components as failed and
components l eachable thiough the failed component as status unknown
3 A method of monitoring a netwoik as recited in claim 1 , further compi ismg determining a
topology of the net oik
4 A method of monitoring a netwoik as recited in claim 3, wheiein the topology f the
netwoik is automatically determined using a netwoik discovery mechanism
5 A method of monitoring a netwoik as recited in claim 3, wherein the topology of the
netwoik is manually specified by a usei
6. A method of monitoring a network as recited in claim 1, the monitoring system
comprising a network monitoring application running on a network monitoring node.
7. A method of monitoring a network as recited in claim 6, wherein the network monitoring
application represents the network as a tree with the network monitoring node at a top of the tree.
8. A method of monitoring a network as recited in claim 7, wherein nodes in the tree
represent nodes in the network.
9. A method of monitoring a network as recited in claim 8, wherein branches in the tree
provide links between components, where one component is dependent on another for routing of
information.
10. A method of monitoring a network as recited in claim 1, wherein each component on the
network is periodically monitored at a defined frequency.
1 1. A method of monitoring a network as recited in claim 10, wherein if a component is
successfully monitored, a next monitoring time for the component is rescheduled based on the
defined frequency.
12. A method of monitoring a network as recited in claim 7, wherein when the component
fails the monitoring request, a high priority is placed on sending a monitoring request to a next component up the tree in a direction of the monitoring device.
13. A computer readable medium including code monitoring a network including a plurality
of components, said computer readable medium comprising:
code for providing a monitoring request to a component for monitoring the network for
an apparent failure of the component on the network; and
code for immediately monitoring availability of a chain of components between a failed
component and a monitoring system to establish which component in the chain is causing the
apparent failure in subsequent components.
14. A computer readable medium as recited in claim 13, further comprising code for
classifying the component in the chain causing the apparent failure in subsequent components as
failed and components reachable through the failed component as status unknown.
15. A computer readable medium as recited in claim 13, further comprising code for
determining a topology of the network.
16. A computer readable medium as recited in claim 15, further comprising code for
automatically determining the topology of the network using a network discovery mechanism.
17. A computer readable medium as recited in claim 15, further comprising code for allowing
the topology of the network to be manually specified by a user.
18. A computer readable medium as recited in claim 13, wherein the computer readable
medium is cable of running on the monitoring system comprising a network monitoring
application running on a network monitoring node.
19. A computer readable medium as recited in claim 13, further comprising code for
periodically monitoring each component on the network at a defined frequency.
20. A computer readable medium as recited in claim 19, wherein if a component is
successfully monitored, a next monitoring time for the component is rescheduled based on the
defined frequency.
21. A monitoring system including a monitoring application for monitoring a network
including a plurality of components comprising:
a monitoi ing application portion for providing a monitoring request to a component for
monitoring the network for an apparent failure of the component on the network; and
a monitoring application portion for immediately monitoring availability of a chain of
components between a failed component and a monitoring system to establish which component
in the chain is causing the apparent failure in subsequent components.
22. A system as recited in claim 21 , the application further classifying the component in the
chain causing the apparent failure in subsequent components as failed and components reachable
through the failed component as status unknown.
23. A system as recited m claim 21, wherein the application determines a topology of the
network.
24. A system as recited m claim 23, wherein the topology of the network is automatically
determined using a network discovery mechanism.
25. A system as recited in claim 23, wherein the topology of the network is manually
specified by a usei
PCT/US2000/035711 1999-12-30 2000-12-29 System and method for topology based monitoring of networking devices WO2001050266A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU26117/01A AU2611701A (en) 1999-12-30 2000-12-29 System and method for topology based monitoring of networking devices

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US17381699P 1999-12-30 1999-12-30
US60/173,816 1999-12-30

Publications (1)

Publication Number Publication Date
WO2001050266A1 true WO2001050266A1 (en) 2001-07-12

Family

ID=22633621

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/035711 WO2001050266A1 (en) 1999-12-30 2000-12-29 System and method for topology based monitoring of networking devices

Country Status (2)

Country Link
AU (1) AU2611701A (en)
WO (1) WO2001050266A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1714222A2 (en) * 2003-10-31 2006-10-25 Seebyte Ltd Intelligent integrated diagnostics
EP1862897B1 (en) * 2006-05-29 2017-05-03 Canon Kabushiki Kaisha Information processing apparatus, printing system, and monitoring method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5093824A (en) * 1990-03-27 1992-03-03 Bell Communications Research, Inc. Distributed protocol for improving the survivability of telecommunications trunk networks
US5117430A (en) * 1991-02-08 1992-05-26 International Business Machines Corporation Apparatus and method for communicating between nodes in a network
US5218601A (en) * 1989-12-22 1993-06-08 Fujitsu Limited Method for searching for alternate path in communication network
US5710777A (en) * 1992-02-07 1998-01-20 Madge Networks Limited Communication system
US5864662A (en) * 1996-06-28 1999-01-26 Mci Communication Corporation System and method for reported root cause analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5218601A (en) * 1989-12-22 1993-06-08 Fujitsu Limited Method for searching for alternate path in communication network
US5093824A (en) * 1990-03-27 1992-03-03 Bell Communications Research, Inc. Distributed protocol for improving the survivability of telecommunications trunk networks
US5117430A (en) * 1991-02-08 1992-05-26 International Business Machines Corporation Apparatus and method for communicating between nodes in a network
US5710777A (en) * 1992-02-07 1998-01-20 Madge Networks Limited Communication system
US5864662A (en) * 1996-06-28 1999-01-26 Mci Communication Corporation System and method for reported root cause analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1714222A2 (en) * 2003-10-31 2006-10-25 Seebyte Ltd Intelligent integrated diagnostics
EP1862897B1 (en) * 2006-05-29 2017-05-03 Canon Kabushiki Kaisha Information processing apparatus, printing system, and monitoring method

Also Published As

Publication number Publication date
AU2611701A (en) 2001-07-16

Similar Documents

Publication Publication Date Title
US7743274B2 (en) Administering correlated error logs in a computer system
US7281170B2 (en) Help desk systems and methods for use with communications networks
US7421695B2 (en) System and methodology for adaptive load balancing with behavior modification hints
US7630313B2 (en) Scheduled determination of network resource availability
US6535990B1 (en) Method and apparatus for providing fault-tolerant addresses for nodes in a clustered system
KR20040093441A (en) Method and apparatus for discovering network devices
JP2005524162A (en) System and method for dynamically changing connections in a data processing network
CN1507721A (en) Method and system for implementing a fast recovery process in a local area network
JP2006285377A (en) Failure monitoring program and load distribution device
JPH1168745A (en) System and method for managing network
US9485156B2 (en) Method and system for generic application liveliness monitoring for business resiliency
AU2001241700B2 (en) Multiple network fault tolerance via redundant network control
US20040003007A1 (en) Windows management instrument synchronized repository provider
US20030035408A1 (en) Redundant communication adapter system for connecting a client to an FDDI network
EP1370918B1 (en) Software-based fault tolerant networking using a single lan
AU2001241700A1 (en) Multiple network fault tolerance via redundant network control
US20050010929A1 (en) System and method for electronic event logging
JPH09319689A (en) Server selecting system
JP2000022783A (en) Method for extending ability of ping function in interconnection between open systems
WO2001050266A1 (en) System and method for topology based monitoring of networking devices
US8046471B2 (en) Regressive transport message delivery system and method
JP2006180223A (en) Communication system
JP2005136690A (en) High speed network address taking over method, network device and its program
JP2011035753A (en) Network management system
Goldman Network Communication

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP