SYSTEM AND METHOD FOR TOPOLOGY BASED MONITORING OF
NETWORKING DEVICES
Reference to Related Application
This present application claims the benefit of provisional Application Serial No.
60/173,816, filed December 30, 1999, which is hereby incorporated herein by reference.
BACKGROUND
1. Field of the Disclosure
The present disclosure relates generally to network monitoring, and in particular, to a
system and method for topology based monitoring of networking devices.
2. Description of The Related Art
In a networked environment, devices are typically connected on various segments of a
network. The various segments of the network may in turn be connected to one another using
some level of connectivity hardware, such as a hub, bridge or router. When a monitoring system
is inserted into the network to monitor the status of the devices connected to the network, a
failure of one of these connectivity devices can cause multiple device failure messages for all
devices on the other side of the connectivity hardware, rather than specifically identifying the
connectivity device that failed. n addition, a failure of a device connected between the
monitoring system and devices on the other side of the failed device will also cause multiple
device failure messages, rather than specifically identifying the device that failed.
An example of a typical network arrangement is shown in Fig. 6. The representation
shown in Fig. 6 depicts a network tree 201 which shows the interconnectivity of the devices on
the network. The network includes a monitoring device 200, monitored devices 202, 204 and
router 206 connected to a hub 208. Router 206 is also connected to hub 210, along with
monitored devices 212-216. Monitoring device 200 is capable of actively monitoring and
actively verifying the availability of the devices on the network, using any one of a plurality of
known network diagnostic or network application requests.
In the network arrangement shown in Fig. 6, if a catastrophic failure such as a power
outage, for example, should occur in hub 208, which renders it unable to transport network
requests from monitoring device 200 to monitored devices 202, 204 and router 206, for example,
all requests to these devices will fail. In addition, all requests to all devices on the other side of
router 206 (e.g., hub 210, and monitored devices 212-216) will also fail. Monitoring device 200
will see this as a lack of availability of all devices on the network, and report all devices as
having failed. Ideally, however, monitoring device 200 would be able to quickly identify and
report hub 208 as a failed device and report the other devices on the network as unavailable. In
another example, if router 206 were to encounter a catastrophic failure instead of hub 208, all
requests to router 206, hub 210 and monitored devices 212-216 would fail. Ideally monitoring
device 200 would be able to quickly identify and report router 206 as having failed and identify
and report hub 210 and any devices connected thereto (monitored devices 212-216) as
unavailable.
One possible method of identifying the failed device on the network may include
performing an analysis using an event correlation type system which attempts to identify the
specific failed component within the network after receiving and correlating failure messages
from multiple networked devices. However, it is often difficult to effectively and efficiently
achieve this result, requiring a considerable amount of time and processing.
SUMMARY
The present disclosure relates to a method and system for monitoring a network
including a plurality of components. The method and system comprises providing a monitoring
request to a component for monitoring the network for an apparent failure of the component on
the network, and immediately monitoring availability of a chain of components between a failed
component and a monitoring system to establish which component in the chain is causing the
apparent failure in subsequent components.
The method and system may further comprise classifying the component in the chain
causing the apparent failure in subsequent components as failed and components reachable
through the failed component as status unknown. The method and system may further comprise
determining a topology of the network, the topology of the network being determined
automatically using a network discovery mechanism, or manually specified by a user.
The method and system may comprise a network monitoring application running on a
network monitoring node. The network monitoring application may represent the network as a
tree with the network monitoring node at a top of the tree.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and other features of the disclosure will be apparent upon consideration of the
following detailed description, when read in conjunction with the accompanying drawings, in
which the reference character's refe_ to like parts throughout and which:
Figure 1 shows an example of a network used for explaining embodiments of the present
disclosure;
Figure 2 is a block diagram of a monitoring system according to an embodiment;
Figure 3 is a flow chart representing the steps carried out during monitoring according to
an embodiment;
Figures 4A and 4B show a tree structure and diagrammatic illustration, respectively, of
the network configuration shown in Fig. 1 ;
Figures 5A - 5E show events in various event queues at specific points in time for
explaining an example of the present system and method;
Figure 6 depicts a network configuration for explaining network monitoring; and
Figure 7 is a block diagram depicting exemplary components capable of being monitored
in a monitored device.
DETAILED DESCRIPTION
In describing the preferred embodiments illustrated in the drawings, specific
terminology is employed for sake of clarity. However, the disclosed embodiments are not
intended to be limited to the specific terminology so selected and it is to be understood that
each specific element includes all technical equivalents which operate in a similar manner.
Fig. 1 depicts an exemplary network 1 to which the present system and method may be
applied. For example, in this embodiment, a monitoring system 20 may be connected to devices
such as a network printer 10 and a computer workstation 12, via hub 26. Workstation 14, server
16 and network printer 18 are connected to each other via hub 24. Devices on hub 24 and
devices on hub 26 are capable of communicating via router 30. The devices on network 1 are
described above as being "connected". This refers to the ability of the devices to communicate
or otherwise inter ace to the network via links 4 and not necessarily to a physical connection. Of
course, other types of devices may be provided on the network, including, for example, network
facsimile devices, other servers, workstations, printers, hubs, routers, etc.
Applications 22 running on monitoring system 20 can communicate with these other
devices on hub 26 and hub 24 via suitable interfaces and protocols based on the operating
systems on the network and the network architecture used. For example, an exemplary network
to which the present system and method can be applied might be an un-bridged local area
network, such as an ethernet or token ring network. The network could be a bridged ethernet or
token ring network, or a combination of token ring and ethernet networks connected by one or
more routers or switches. The network could also be a Wide Area Network (WAN) including the
Internet or an intranet system, for example, using the TCP/IP protocol to communicate and
connect with other local area networks. An embodiment of a monitoring system according to the
present disclosure is capable of operating on a network 1 as shown in Figure 1 in which two
ethernet LANs (LAN 1 and LAN 2) in a star configuration are linked together by router 30.
The components communicating over the network use low level protocols dependent on
the transport medium used and higher level protocols dependent on the operating systems and
applications executing on the linked network components. A discussion regarding details of the
communication between the network components is not necessary for an understanding of the
present disclosure. It is assumed that the monitoring application 22 or another application
executing on the monitoring device 20 is provided with the functionality to communicate with
the other specific network components on network 1 and to thereby monitor the components and
establish whether they are operating appropriately. Of course, the monitoring application 22 can
be provided on one of the other devices shown in Fig. 1 or on a device remote to hubs 24, 26 and
used to monitor the hubs themselves and one or more of the other devices on the hubs. A
monitoring device may itself be monitored by another device running a monitoring application.
It is assumed hereinafter that suitable protocols and network operating systems are in
place and that the interfaces, files, ports and other components of the remote devices can be
accessed through application program interfaces (APIs) or appropriate remote procedure calls
(RPCs). Such arrangements are very well know and utilized on most networked devices today.
Monitoring system 20 may be a standard PC, laptop, mainframe, etc. capable of running a
monitoring application according to an embodiment described herein. Fig. 2 depicts a block
diagram of exemplary elements monitoring system 20 may include. Of course, monitoring
system 20 may not include each element shown and/or may include additional elements not
shown. As shown, monitoring system 20 may include a central processing unit (CPU) 62, a
memory 64, a clock circuit 66, a printer interface 68, a display unit 70, a LAN data transmission
controller 74, a LAN interface 76, a network controller 78, an internal bus 80 and one or more
input devices 72 such as, for example, a keyboard and mouse.
CPU 62 controls the operation of system 20 and is capable of running applications stored
in memory 64. Memory 64 may include, for example, RAM, ROM, removable CDROM, DVD,
etc. Memory 64 may also store various types of data necessary for the execution of the
applications, as well as a work area reserved for use by CPU 62. Clock circuit 66 may include a
circuit, for example, for generating information indicating the present time.
The LAN interface 76 allows communication between the ethernet LAN 1 , and the LAN
data transmission controller 74. The LAN data transmission controller 74 uses a predetermined
protocol suite to exchange information and data with the other devices on the network 1.
Monitoring system 20 may also be capable of communicating with and/or monitoring the devices
on LAN 2 via router 30 and/or on other networks. System 20 may also be capable of
communicating with other devices via a Public Switched Telephone Network (PSTN) using
network controller 78. Internal bus 80, which may actually consist of a plurality of buses, allows
communication between each of the components connected thereto.
Each of the devices on network 1 may be capable of being monitored. Accordingly, as
depicted in Fig. 1 , network printer 10, workstation 12, workstation 14, server 16 and network
printer 18 are identified as monitored devices 1-5, respectively, as shown in Fig. 1 . Although not
specifically identified as monitored devices, hubs 24, 26 and router 30 are also capable of being
monitored. Of course, additional devices, whether capable of being monitored or not, may be
provided on network 1.
Each of the devices on network 1 may include one or more of the components shown in
Fig. 7, each of which is capable of being monitored by system 20. The definition of what
constitutes a component as used in this description is a matter of choice depending on the
implementation. However, the following are examples of the types of components that may be
monitored and the present embodiments envision the monitoring of any component capable of
being monitored. As shown in Fig. 7, a device may include one or more system event logs 40,
one or more system error log files 42, one or more application log files 44, TCP/IP port
(connectivity) information 46, services/processes running status information 48, physical
performance data 50 and/or physical network interface (availability) information 52.
The monitoring application 22 run by monitoring system 20 is capable of monitoring one
or more of the components shown in Fig. 7 for each of the devices on network 1 . The
monitoring application 22 maintains a set of queues, each queue corresponding to one of the
devices being monitored. The queue structures will be described in more detail later below.
Briefly, however each queue includes one or more monitoring events, each monitoring event in
each queue containing an instruction for monitoring a specific component (see Fig. 7) on the
associated monitored device. Further, the queues could alternatively be combined into a single
queue identifying the component associated with each event. Once a monitoring event in the
queue is triggered, an independently executing task or thread is generated which contains
instructions to wait for a response to the monitoring event and then process the results in a
manner consistent w ith the results. If an error message or response is received from the device or
if no response is received within a set amount of time, the monitoring event for that device is
deemed failed.
Monitoring device 20 has access to information showing the structure or configuration of
the network. This information may be in the form of, for example, a tree structure or network
topology. The network configuration information identifies the devices on the network and how
the devices are arranged on the network. The network configuration information may be
maintained in memory 64 of monitoring device 20, or otherwise accessible by the monitoring
application 22. The network configuration information may include a set of structures or objects
each representing a device on the network, and a linked list of pointers to other devices. The
pointers may include pointers to parent devices. The network configuration information may, for
purposes of illustration, be represented as a tree structure, as shown in Fig. 4A. In the
alternative, the network configuration information may, for purposes of illustration, be
represented diagrammatically as shown in Fig. 4B. As shown in Fig. 4B, for each device, its
parent or dependent device exists above and at one level less of indentation. In this example,
monitoring device 20 and hub 26 do not have any parent or dependent device. The network
configuration information may be obtained automatically using a network discovery system or
may be manually specified by a user.
As show n in Fig. 4A, the network configuration information represents the route taken
from device to dc\ ice by data transmitted to and from the monitoring device. Stated another
way, a first device can be said to be linked to a second device one level higher in the tree if: a
failure of the second device would prevent communication between the monitoring device and
the first device: and there are no further monitored devices between the first and second devices
which arc instrumental m communication of data between the first device and the monitoring
device.
For example, with the network configuration shown in Figure 4A a failure in monitored
device 1. monitored device 2 or router 30 will not affect any other device directly connected to
hub 26. However, a failure of hub 26 itself may cause the entire section of the network (e.g., all
devices connected to hub 26) to cease operating. Thus, monitored devices 1 , 2 and router 30 can
be said to be independent, but are dependent on hub 26 to communicate with each other and the
monitoriim svstcm 20.
The netw oi k configuration information can be automatically established by monitoring
system 20 by suitably polling the devices on network 1 in a known manner The network
configuiation mfoi mation need not be part of a laigei data structure, such as a detailed
repiesentation of the netwoik configuration, as long as devices instrumental in transmitting data
between nodes can be ascertained from the structure
Monitoi ing application 22 is configuied to penodically monitor each device on network 1
at scheduled mtei v als Following successful monitoring of a device, the next monitoring time
foi that de\ ice is l escheduled based on a set monitoring interval An example of a monitoring
schedule table is show n in Figui e 5A Figure 5A shows all device queues merged into a single
table foi clai ιt\ m discussion Of com se, each queue foi each device may actually be arranged
sepai ately 1 he schedule may be laid out as a table including a "Queue Foi" column 31 , an
"Event" column 32 and a "Monitoiing Scheduled" column 33 Column 31 identifies the device
queue Column 32 descnbes the event taking place for the device identified m column 31 ,
indicating, loi example, cithei that the device is scheduled to be monitored oi identifying a
device that fai led duπng a pi evious monitoring event Column 33 indicates the time that the
event occui i cd
Fig 3 is a llowchart for describing the processing of a monitoiing event accoidmg to an
embodiment Initially, in Step S2, monitoring of the fu st device in the queue is started A
deteimmation w hethei the monitoring event for the selected device was successful is then made
(Step S4) Foi example, a monitoi ing event may include monitoring TCP/IP port connectivity
sei vices/piocesses I mining mfoimation 48 (Fig 7) monitored device 3 (woikstation 14) If the
monitoiing cnt w as successful (Yes, Step S4), a determination is made in Step S6 whether
monitored device 3's queue includes a previous device failure indication. If Yes in Step S6, the
failure of the previous device indicated as having failed is processed (Step S7). If No in Step S6,
the next device to be monitored is selected (Step S5), monitoring is started (Step S2) and the
process returns to Step S4 for determining whether the monitoring event for the device was
successful. If the monitoring event was not successful (No, Step S4), the event queue for the
device being monitored is purged (Step S8). A determination is then made whether the device
has a parent, by reference to the network configuration information (e.g., see Figs 4A, 4B). In
Fig. 4A, a parent device may be considered a device on a level immediately "above" a device in
the tree structure and having a link thereto. In Fig. 4B, a parent device exists above and at one
level less of indentation. If the device does not have a parent (No, Step S10), the failure of the
device just monitored is processed (Step S12). If the device has a parent (Yes, Step 10), the
parent device is rescheduled for immediate monitoring (Step S14), and information identifying
the failure of the child device is added to the parent device's queue (Step SI 6) and monitoring is
started (Step S2). The procedure then returns to Step S4 and monitoring of the parent device is
performed in a similar manner. This process continues until a device is successfully monitored
(Yes, Step S4) and its event queue includes a child device's failure indication (Yes, Step S6) or
until it is determined that a failed device does not have a parent (No, Step S l ϋ). The reference to
scheduling a device for "immediate monitoring" refers to placing the device at the head of the
queue table to be monitored next, and not necessarily to the exact time frame for which the
monitoring event is to occur.
The processings performed in Fig. 3 will now be further explained by reference to an
example, using the network tree configuration information as shown in Figs. 4A and 4B. In this
example, it is assumed that hub 26 av; ilability has been verified, and hub 26 has been
rescheduled for monitoring at a later t me. It is further assumed that this is followed by a
catastrophic failure of hub 26. Fig. 5A depicts the state of the monitoring queue at this point in
time 34.
In this example, as shown in Fig. 5A, monitored device 3 (workstation 14) is the next
device in the queue scheduled to be monitored. Monitoring of monitored device 3 begins (Step
S2) In Step S4, a determination is made whether the monitoring event for monitored device 3
was successful. If there had not been a catastrophic failure of hub 26, the monitoring event
would have been successful (Yes, Step S4) in this example, and since the event queue for
monitored device 3 would not include a child devices failure indication (No, Step S6) the next
device in the queue would have been selected to be monitored (Step S5). However, in this
example, this monitoring request fails due to the catastrophic failure of hub 26 (No, Step S4).
Accordingly, the event queue of monitored device 3 is purged (Step S8), since monitored device
3 is unreachable. A determination is then made in Step S10 whether monitored device 3 has a
parent, by referring to the network configuration as shown in Figs. 4A, 4B. If there were no
parent device (No. Step S 1 ) the failure of monitoring device 3 would be processed (Step S12).
Processing of the failure may include notification of the failure of monitoring device 3 to a user
via a monitor and/or storage of information identifying the failure for future reference. However,
in this case, since monitored device 3 has a parent device (hub 24) (Yes, Step S10), hub 24 is
then rescheduled for immediate monitoring (Step S I 4). "Failure of MD3" is then placed into the
event queue for hub 24, (Step SI 6) identifying that monitored device 3 failed. At this point in
time, the monitoring event queue is as shown in Fig. 5B. As shown, hub 24 is the next device to
be monitored and Hub 24's queue ir dicates that monitored device 3 failed. Monitoring of hub 24
is then started (Step S2) and the process repeats.
If the monitoring of hub 24 had succeeded (Yes, Step S4), this would indicate that hub 24
was operating properly and looking at hub 24's event queue (Step S6) indicates that monitored
device 3 had failed. The "Failure of MD3" event, placed in the queue of hub 24 would then be
processed (Step S7). However, in this example, the monitoring of hub 24 fails (No, Step S4).
Accordingly, the event queue for hub 24 is purged (Step S8), to remove the "Failure of MD3"
event from its queue. A determination is then made in Step S10, whether hub 24 has a parent
device by reference to the network configuration. If hub 24 did not have a parent device (No,
Step S 10), the failure of hub 24 would be processed so that notification of the failure of hub 24
would be provided. However, in this example, since hub 24 has a parent (router device 30) (Yes,
Step S 10), router 30 is rescheduled for immediate monitoring (Step S14). The "Failure of hub
24" is then placed into the event queue for router 30 (Step S14). Fig. 5C depicts the monitoring
event queue at this point in time. Monitoring of router 30 is then started (Step S2).
If the monitoring of router 30 succeeded (Yes , Step S4), this would indicate that router
30 was operating properly and by referring to router 3()'s event queue (Step S6) it could be
determined that hub 24 had failed. The "Failure of hub 24" event would then be processed (Step
S7). However, in this example, the monitoring of router 30 fails (No, Step S4). Accordingly, the
event queue for router 30 is purged (Step S8), to remove the "failure of hub 24" event from its
queue. A determination is then made in Step S10 whether router 30 has a parent device by
reference to the network configuration information. If router 30 did not have a parent device
(No, Step S 1 ). the failure of router 30 would be processed so that notification of the failure of
router 30 would be provided. However, in this example, since router 30 has a parent (hub 26)
(Yes, Step S 10), hub 26 is rescheduled for immediate monitoring (Step SI 4). The "Failure of
router 30" is placed into the event queue for hub 1 (Step SI 6). Fig 5D depicts the monitoring
event queue at this point in time. Monitoring of hub 26 is then begun (Step S2).
If the monitoring of hub 26 succeeded (Yes, Step S4), by referring the event queue (Step
S6) it could be determined that router 30 had failed. The "Failure of Router 30" would then be
processed (Step S7). However, in this example, the monitoring of hub 26 fails (No, Step S4) and
its event queue is thus purged (Step S8), to remove the "Failure of Router 30" event from its
queue. The determination is then made in Step SI 0 whether hub 26 has a parent device by
reference to the network configuration. In this example, since hub 26 does not have any parent
device (No. Step S 1 ), the failure of the current device (hub 26) is processed (Step S12).
At this point, hub 26 would be classified as a failed component, and the availability for
devices dependent on hub 26 would be classified as having unknown status. In this topology, the
devices classi ied as status unknown would be monitored device 1 , monitored device 2,
monitored device 3. monitored device 4, monitored device 5, router 30 and hub 24. This can
easily be accomplished by suitably recursively navigating the tree. In the alternative, a list of
failed devices can be created as they are determined and then classified as unknown status as
appropriate. In addition, it may be desirable to purge all the queues for the devices downstream
of hub 26 to ready for the next monitoring event (see Fig. 5E).
In an alternative embodiment, instead of purging all events in a queue when a device is
known to be unreachable, any "failure of dependent device" event in the queue can be deleted
from the queue and the queue can be suspended. This allows scheduled events associated with
unreachable components to be maintained and possibly executed when the failed component is
functioning correctly again. For example, instead of completely purging the device's queue in
Step S8 of Fig. 3, only the "failure of dependent device" portion of the queue can be purged, thus
maintaining future scheduled monitoring events for the device in the device's queue. The
device's queue can then be suspended for a set period of time or until it is determined that the
failed component is functioning properly.
The present system and method may be conveniently implemented using one or more
conventional general purpose digital computers and/or servers programmed according to the
teachings of the present specification. Appropriate software coding can readily be prepared by
skilled programmers based on the teachings of the present disclosure. The present system and
method may also be implemented by the preparation of application specific integrated circuits
or by interconnecting an appropriate network of conventional component circuits.
Numerous additional modifications and variations of the present system and method are
possible in view of the above-teachings. It is therefore to be understood that within the scope
of the appended claims, the present disclosure may be practiced other than as specifically
described herein.