GB2421656A

GB2421656A - Distributed network fault analysis

Info

Publication number: GB2421656A
Application number: GB0428189A
Authority: GB
Inventors: Paul Kettlewell
Original assignee: Nortel Networks Ltd
Current assignee: Nortel Networks Ltd
Priority date: 2004-12-23
Filing date: 2004-12-23
Publication date: 2006-06-28
Also published as: GB0428189D0

Abstract

A communications network comprises network elements which each have an alarm processing function. Each alarm processing function has control logic which receives a trigger (e.g. internal alarm) from the element which is indicative of a fault in the operation of the element. The control logic determines the root cause of the fault issues an alarm externally to the element which identifies the root cause of the fault. The control logic within each alarm processing function is configurable by a network management entity external to the element. The control logic can be implemented as a set of state machines which operate in parallel with one another. Each state machine responds to a set of triggers. Preferably, only one state machine is permitted to issue an alarm external to the element for each root cause.

Description

DISTRIBUTED NETWORK FAULT ANALYSIS

FIELD OF THE INVENTION

This invention relates to processing of faults and alarms within a communications network.

BACKGROUT'1D TO THE INVENTION One of the functions for maintaining reliable operation of a telecommunications network is to detect when faults occur at equipment within the network and to ensure that traffic is not disrupted for any longer than is necessary. Figure 1 shows a traditional model for raising and responding to alarms within a network, which is based on the International Telecommunication Union (1TU) Traffic Management Network (TMN) model described at 1TU-T M.3400. At the lowest level are network elements 11, 12, 13 such as transceivers, switches and routers. When a fault occurs at an J5 element an alarm is raised. The alarm can include an attribute which indicates the *5* severity of the alarm, which ranges from critical', meaning it must be acted upon immediately, through major', minor' and warnings'. The alarms 15 are passed * through an Element Manager 20 which typically provides little extra value other than : * to provide visibility of the alarm. Alarms are then passed to the Network Management * 20 Level (NML) where a network management system 30 aggregates alarms from all of 5::: the elements 11, 12, 13 in the network.

* Tools at the management system 30 typically provide an operator with a visual report of the alarms. This may include a banner showing the total number of alarms of each level of severity and the capability to browse lists of active and historical alarms.

A fault at a network element may give rise to a large number of alarms.

Consider an example where two network elements 11, 12 are connected by a physical cable. The cable is disconnected from one of the bays of the network element. The driver software that interfaces with the cable will warn that it is having write errors and subsequently may time out and generate a minor warning. Software at the link level will warn about corrupt information and will begin to time out. Buffers holding unsent data may also begin to overflow, generating alarms. The applications that are communicating via the cable will also raise alarms. An operator of the network management system will receive a flood of alarms 15 which all relate to the two elements and a set of applications.

Alarms can be interpreted by an experienced human operator or by an automated expert system 32 at the network management system 30 which includes a set of rules and which models behaviour of the network. A system of this type is described in US 6,598,033. The rules can help an operator to identify the root cause of a fault by correlating alarms. Although these systems can help to determine the root cause of a fault, they do not reduce the high volume of alarm traffic carried by the network. Even with sophisticated tools at the network management system, it can be difficult to identify the root cause of a fault due to a flood of alarms.

It is desirable to reduce the number of alarms at the network management layer and to improve the quality of information sent to the network management layer.

SUMMARY OF THE INVENTION

A first aspect of the present invention provides a management function for a network element of a communications network comprising control logic which is i,5 operable to: I...

* *. receive at least one trigger from the element which is indicative of a fault in the *...

operation of the element; determine the root cause of the fault within the element; and, *:- issue an alarm externally to the element which identifies the root cause of the fault, : :: : and wherein the control logic is configurable by a network management entity external * to the element.

Because a network element is attempting to find a root cause of a fault, the quality of information propagated to the network management system is improved.

This allows an operator of a network management system to more readily identify a network fault and to respond more quickly with remedial action. As the management function can be configured by a network management entity external to the element, an operator can specify rules that are particular to their type of network and equipment. A network element can be loaded with control logic (rules) that are specific to that type of element, or the network topology in which the element is located. Rules can be updated as often as is necessary by the external management entity so as to adapt to changes in the equipment at the element and changes to network topology. This can allow the reasoning capability at the network element to be at least as good as that used at the network manager.

Preferably, unnecessary alarms are suppressed at each network element and are not allowed to propagate onto the network. The suppression of alarms at network elements reduces the amount of alarm traffic carried over the network, which frees resources for other uses. This also significantly reduces the total number of alarms that are presented to an operator. However, when an alarm is raised by an element which exceeds a certain level of severity it can be desirable to allow that alarm to be issued to the network manager.

Preferably, the management function at each network element can initiate diagnostic operations to determine the root cause of a fault. The initial trigger that is presented to the management function may be indicative of several possible root causes. By performing diagnostic operations, the element can identify which root cause is most likely to apply.

The functionality described here is most advantageously implemented in software executed by a processor at a network element. Accordingly, another aspect of eSSS : 15 the invention provides software for implementing the method. The software can be a...

stored on an electronic memory device, hard disk, optical disk or other machine- readable storage medium at the element. The software may initially be delivered to the S network element as a computer program product on a machine-readable carrier or it * may be downloaded directly to the network element via a network connection to the : 20 network management system. The set of rules can be initially loaded, and *0*S *,: subsequently updated, by a network connection to a network management entity. The invention can alternatively be implemented by means of hardware.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will be described, by way of example only, with reference to the accompanying drawings in which: Figure 1 schematically shows the handling of a fault in a conventional communications network; Figure 2 schematically shows the handling of a fault in a network according to an embodiment of the invention; Figure 3 shows an alarm processor within a network element of the network of Figure 2; Figure 4 shows the protocol layers of two network elements of the network of Figure 2; Figure 5 shows an example of a state machine for determining a root cause of a fault implemented by the alarm processor of Figure 3; and Figure 6 shows a network management entity which distributes rules to network elements.

DESCRIPTION OF PREFERRED EMBODIMENTS

Figure 2 shows a communications network in accordance with an embodiment of the present invention. As in Figure 1, the network comprises a plurality of network elements 11, 12, 13 which are connected to each other to provide transport of conmiunications traffic. The network elements 11-13 can include any network element which processes or passes traffic, or provides a service to users of the network, such as (but not limited to) terminals, transceivers, multiplexers, amplifiers, gateways, routers, S...

*..: i switches, call processors or support nodes. The network can be packet or circuit- S...

* ... switched, and can use one or more of electrical, optical or wireless technologies. Each *: * network element includes a function which will be called an Alarm Processor 14. The Alarm Processor 14 incorporates similar reasoning capabilities as are conventionally S'S...

* performed by the expert system 32. In operation, a network element hosting the Alarm * 20 Processor will generate alarms when faults occur. Various other status information can * * be generated by the element which can help diagnose the cause of a fault. The Alarm Processor 14 subscribes to alarms and other triggers' that are generated by the host network element. The Alarm Processor 14 uses reasoning capabilities to determine the root cause of a fault based on the triggers. The Alarm Processor at each element 11, 12, 13 includes rules specific to that type of element, and the network topology surrounding that element. In the simple example of Figure 2, element 11 knows that it is connected to elements 12, and 13. The AP 14 follows the rule(s) and generates an alarm message 16 which indicates the cause of the fault. This is received by the network management system 30 and acted upon. As the message 16 now indicates the cause of a fault, rather than simply a symptom, it is possible for the network manager to more quickly respond to the fault, taking corrective action to redirect traffic and to schedule maintenance which will correct the fault at the element 11.

Figure 3 shows the Alarm Processor within a network element in more detail and Figure 4 shows an example of a state machine implemented by Alarm Processor.

The main control logic within the alarm processor is a set of rules 82. Each rule 82 provides reasoning in the form of a state machine 80 which attempts to determine the root cause of a fault. There is a rule 82 and an associated state machine 80 for each pattern of behaviour that may arise within that network element. A connection database 83 stores the location and details of all the other Alarm Processors within the network that it needs to collaborate with. This can also include information about which other network elements are connected to the network element 11, the number and type of circuits connecting to the adjoining elements, address information etc. As will be more fully described, the rules 82 and connection database 83 can be updated by an external entity 350.

The Alarm Processor function 14 has four main interfaces 51, 61, 71, 81.

Interface 81 is to the host processor of the element, and allows the host processor to execute the code which supports the alarm processor function 14. If a dedicated processor executes the alarm processor function 14 then interface 81 is not required. S..

:15 Interface 51 allows the alarm processor 14 to interact with applications 50 running on s.' the host network element. Typically, a host will run applications such as Call * : * processing, routing, translations, resource allocation. The alarm processor can access all or only a selection of the applications 50, depending on whether the applications are S.....

* likely to indicate the root cause of a fault. Alarm processor 14 (or more accurately, the * 20 state machines 80) subscribe 53 to receive notification of events and alarms generated . by applications 50. When a requested event occurs at one of the applications 50, an event message or alarm 52 is passed to a respective state machine within the alarm processor 14. The alarm processor can also initiate the running of diagnostic operations 54 on applications 50 and receive results of those diagnostic operations across interface 51. Finally, alarm processor 14 can inspect state 55 of an application at any time by passing messages across interface 55.

Interface 61 connects alarm processor 14 to other elements within the network.

A state machine 80 generates a query 62 and receives a response 63 (or possibly no response) in reply. The query 62 can be a request to run a diagnostic operation or to check the state of an application running on another element in the same manner as for local applications 50 across interface 51. This interface can also carry triggers to APs on other elements, which will be monitored by state machines on those elements. This interface can also be used to receive requests, from an Alarm Processor on another element, to perform diagnostics on element 11. Connection database 83 stores information about the location of neighbouring Alarm Processors and elements that the Alarm Processor may need to collaborate with, and how to communicate with them.

Alternatively, database 83 can include a link to a broker entity which can identify them.

Interface 71 connects alarm processor 14 to a network manager entity at a higher level of the network. Alarms 73 are passed to the network manager and rule updates and topology information 72 are received from the network manager.

Referring to Figure 5, this shows a simple example state machine 220 for determining the root cause of a fault, which can be executed by the Alarm Processor of Figure 3. Figure 4 is a high-level view of two network elements 11 and 12 of the network of Figure 2, showing an operating system, a protocol stack and an application (e.g. call processing) supported by a processor at each of the elements. The state machine waits in initial state 200 for a failure to occur. As described above, state machine has subscribed to events and alarms (triggers) to ensure that it is informed : 15 when this fault occurs. In this example the trigger is an alarm message indicating S...

"Application A failure to communicate with Application B fault." When this trigger * :* * (or one of the other specified triggers for this state machine) is received, the state machine progresses to state 202.

S.....

* At state 202 the state of the protocol stack A is determined. This is compared * . 21J with stored data about the expected state. If the inspected state is found to be bad, then * e this indicates a root cause of the problem. State machine 220 knows that there is another state machine investigating this root cause and knows that there is no need for it to raise an alarm. The original alarm received at state 200 is suppressed.

At state 204 a diagnostic test is performed on protocol stack A. If the test indicates that there is a problem with the protocol stack then this is identified as the cause of the initially received alarm at state 200. State machine 220 knows that there is another state machine investigating this root cause and knows that there is no need for it to raise an alarm. The original alarm received at state 200 is suppressed.

At state 206 a test is made of basic connectivity between element 11 and element 12 by performing a ping' on element 12. If the ping is successful, the state machine 220 knows that there is basic connectivity between the elements 11, 12 and proceeds to step 208. Returning to state 206, a suitable timeout limit is set for receiving a reply from element 12. If the timeout period expires without receiving a reply, the state machine 220 reasons that this is the root cause of the reported fault at 200. An alarm identifying the root cause of a basic connectivity failure will be raised by another state machine which investigates this fault, or by the diagnostic test on connectivity.

So far, the steps have checked the status of element 11. As this fault could also involve a problem with element 12, the next steps investigate causes on element 12. At state 208 a check is made on the state of protocol stack B. If this is successful, then a diagnostic check is requested on protocol stack B. This is requested via the Alarm Processor resident on element 12. If either of these checks returns a value which indicates there is a fault, this is identified as the root cause fault. State machine 220 knows that element 12 has an Alarm Processor which will be performing similar checks and therefore suppresses the alarm which would otherwise be issued. Upon passing these checks, the state machine 220 proceeds to state 210 where a diagnostic test is requested on Application B. This test is requested via the Alarm Processor resident on element 12. If this test fails then the state machine 220 determines S..

* . .:15 Application B as the root cause of the fault. State machine 220 knows that element 12 S...

*** has an Alarm Processor which will be performing similar checks and therefore * *, suppresses the alarm which would otherwise be issued.

If the checks requested on element 12 are successful then the state machine 220 could proceed to other tests. In this simple example there are no further tests to * . : 20 perform and so the state machine proceeds to state 211 to issue an alarm reporting the * fault "Application A communication failure to Application B". It is envisaged that the rule followed by state machine 220 will be comprehensive enough to determine the root cause on most occasions. This will minimise the proportion of occasions when the state machine enters state 211.

In summary, the state machine 220 either:

(i) determines the root cause of the originally reported fault and issues an alarm to the network manager; (ii) determines the root cause and takes no action (i.e. suppresses the alarm) when it knows that another state machine will be issuing, or has already issued, an alarm reporting the root cause; or, (iii) issues the original alarm if the root cause cannot be determined.

It can be seen that the state machine 220 attempts to identify a root cause of an initially reported alarm (or other trigger), converting the internally reported alarm into an alarm which more accurately reports the fault. This improves the quality of the reported alarms and allows an operator at the network manager to arrange remedial action for that fault and will save the operator from diagnosing that fault themselves.

In general, the state machine receives one or more triggers at the initial stage (wait for fault 200) from the host element or an element attached to the host element.

Further triggers may also cause state changes. In each state one or more actions are performed e.g. check element sub-component state is X or initiate diagnostic Y. Some actions may not receive a reply and thus timeouts are included to ensure some intermediary result can be generated, as in state 206 of Figure 5 when no response is received to a ping operation. Each state machine registers with the host element for the triggers it requires. The following events may cause the state machine to change state: receiving a trigger from the element in the form of a report fault (e.g. receiving an alarm at state 200); receiving a result of a diagnostic which has been requested (e.g. result of diagnostic on Application B at state 210); the expiry of a timer (e.g. timeout in state 206); a method invocation from a collaborating object (e.g. asking element 12 * :* : to perform a diagnostic check on protocol stack B.) When a state machine receives a trigger it may cause some or all of the * following actions: * . : 20 the state machine changes state; I the trigger may be a critical or a high priority alarm so it is immediately forwarded to the management system via the Element manager (20, Figure 2) so that it may be acted upon as well as cause other actions with the state machine; the object may obtain additional information about the element by inspecting state or initiating diagnostics; a method may be invoked on a collaborating object, i.e. the Alarm Processor requests that a diagnostic or other check is performed on another element (e.g. the diagnostic test at states 208, 210 of Figure 4) a timer may be started; the root cause may have been identified so a fault is generated.

The logic for determining the root cause has been presented as a set of state machines. These can be implemented via procedural code, look-up table or an interpreted rules language.

As noted above, there are multiple rules, each being implemented as a state machine. Multiple state machines operate in parallel with each other, each looking for the root cause to a particular trigger, or set of triggers. The same trigger can be fed to multiple state machines. Several state machines may determine the same root cause, although ideally only one state machine should be permitted to issue an alarm externally of the element for a given root cause. In the example of Figure 4, the initial "Application A failure to communicate with Application B" alarm message can be diagnosed as a basic connectivity failure to element 12 at 207. Other state machines, which are responsive to alarms such as "buffer overflow in interface to element 12" will also diagnose this same root cause. The set of state machines are implemented in a manner which ensures that the minimum number of external alarms is issued for any fault condition. Ideally, only one alarm is issued which represents the root cause of the fault.

Figure 6 shows a network management entity 350 which forms part of the * . *[5 network of Figure 2. Entity 350 can be co-located with the expert system 32 which * *IS *** conventionally logs and analyses alarms within the network or it can be located : elsewhere within the network. Entity 350 selects 320 and packages 325 rules for :: sending 330 to network elements. The rules allow the elements to determine, for themselves, each possible root cause that can arise at that element. The rules can be : 20 generated by an automated system (e.g. existing system 32), with rule selection * . function 320 selecting existing rules used by the expert system 32 that are applicable to each network element. Other rules can be written by a human operator based on observed patterns of behaviour and expert knowledge of a fault. Each rule is packaged 325 in the form of an event-driven state machine of the type just described. For each root cause fault rule, the network elements giving rise to that root cause are identified, e.g. amplifiers of type X. A database 315 of element data is used to make this selection. A copy (instance) of that rule is distributed to the required network elements and installed in each of the alarm processors 14. Each alarm processor 14 stores a set of rules 82 appropriate to that element. For each fault identified above an instance of the object is instantiated at each element capable of raising the fault. Topology information can be included within the rules.

Rule semantics should either be standardised, or the rules generation engine 350 must be vendor aware and generate rules in the particular format of the vendor of an Alarm Processor. Inter-Alarm Processor messaging can either be standardised or the connection details for the interface (61, Figure 3) should include the vendor specific considerations.

The invention is not limited to the embodiments described herein, which may be modified or varied without departing from the scope of the invention.

S * * *.S* * . * * **

a..... * a

S *.I S. S * . .

Claims

Claims: 1. A management function for a network element of a communications

network comprising control logic which is operable to: receive at least one trigger from the element which is indicative of a fault in the operation of the element; determine the root cause of the fault within the element; and, issue an alarm externally to the element which identifies the root cause of the fault, and wherein the control logic is configurable by a network management entity external to the element.
2. A management function according to claim 1 wherein the trigger is an alarm generated within the network element.

*.*. 15
3. A management function according to claim 2 wherein the control logic is * :* : operable to receive a plurality of alarms from the element which are related to a common root cause.

* . : 20
4. A management function according to claim 3 wherein the control logic is *: * operable to suppress at least some of the alarms.
5. A management function according to claim 3 wherein the control logic is operable to issue only a single alarm external to the element.
6. A management function according to claim 1 wherein the control logic initiates a diagnostic operation on the network element or another network element connected to the network element.
7. A management function according to claim 1 wherein the control logic determines a state of an application running on the element.
8. A management function according to claim 1 wherein the management function is operable to register with an application at the network element which will generate triggers indicative of a fault.
9. A management function according to claim 1 wherein the management function comprises a store of connection details for collaborating objects that the network element is connected to.
10. A management function according to claim 1 wherein the control logic is in the form of a state machine.
11. A management function according to claim 10 wherein there is a plurality of state machines operable in parallel with one another, each state machine differing in : the trigger, or set of triggers, that it receives.

**** 15 * ** S
12. A management function according to claim 11 wherein the plurality of state * ** machines are arranged such that if a first state machine determines a root cause for * * which a second state machine will issue an alarm externally to the element, the first state machine does not issue an alarm externally to the element.

:
13. A management function according to claim 11 wherein only one state machine of the plurality of state machines is permitted to issue an alarm external to the element for each root cause...CLME:
14. A network element comprising a management function according to claim 1.
15. A communications network comprising a network element according to claim 14 and a network management entity which is operable to configure the control logic at the management function of the network element.
16. A method of raising an alarm at a network element of a communications network comprising, at the network element: receiving a trigger from the element which is indicative of a fault in the operation of the element; determining the root cause of the fault within the element using control logic which is configurable by a network management entity external to the element; and, issuing an alarm externally to the element which identifies the root cause of the fault.
17. A method of operating a network management entity of a communications network compnsing a plurality of network elements which each implement an alarm processing function, the method comprising: selecting a rule to determine the root cause of a fault within one of the elements; and sending the rule to the network element for execution locally by the alarm processing function within the network element.
18. A signal for transmission across a communications network comprising a ::: 15 plurality of network elements which each implement an alarm processing function, the * signal carrying a rule to determine the root cause of a fault within one of the elements. ** . * ** * * *

* : *.:
19. A computer program product comprising a machine-readable medium carrying instructions for providing an alarm processing function at a network element of a * : 20 communications network, the instructions causing the network element to: . .: receive a trigger from the element which is indicative of a fault in the operation of the element; follow control logic to determine the root cause of the fault within the element, the control logic being configurable by a network management entity external to the element; and, issue an alarm which identifies the root cause of the fault.