WO2017050621A1

WO2017050621A1 - Fault diagnosis

Info

Publication number: WO2017050621A1
Application number: PCT/EP2016/071722
Authority: WO
Inventors: Gerard POWIS
Original assignee: British Telecommunications Public Limited Company
Priority date: 2015-09-25
Filing date: 2016-09-14
Publication date: 2017-03-30
Also published as: GB2542610B; GB2542610A; GB201517018D0

Abstract

Fault conditions in a distributed system (2) are diagnosed by correlation of routine network condition reports (50) collected by the network and stored in a database 55, with reports of faults (60) identified at remote termination points, typically by end-user, which are recorded in a fault logging system (6). Correlation of fault or condition reports, by reference to the topology (8) of the system allows better identification (7) of the cause of a fault, and allows co-ordination of routine maintenance and inspection tasks (91) with ad hoc task management (90) to attend to faults recorded by the fault logging system.

Description

FAULT DIAGNOSIS

This invention relates to diagnosis of faults in networks and other distributed systems. It can often be very difficult to diagnose the root causes of reported faults in such a system as they may be manifested as several individual fault reports made at different locations which may be remote from each other and from the actual cause of the problem. Current fault identification methods rely heavily on previous fault history, but it takes time to identify, plan and carry out the work required.

For example, in a telecommunications network, the attempt to predict end- user faults using individual tests across tens of millions of lines, and to interpret those results, can trigger many false correlations. (For example, two fault reports from terminations connected to the same distribution point may be the result of a fault at the distribution point itself, but it may also be the result of two quite unrelated problems on the two individual lines).

Typical repair processes are classified as either reactive - an end-user enquiry prompts the network operator to run a line test, or to dispatch a technician to repair the fault, or as proactive - an existing fault history (closed faults and results of routine line tests) is analysed and a technician sent to survey the network to confirm the need for preventative maintenance to prevent further problems or arrest a deteriorating trend, and then carry out the required preventative maintenance. Reactive processes are generally handled on an individual basis because they relate to failures that have already happened and service needs to be restored as soon as possible,, whereas proactive processes can be scheduled in such a way as to make the most efficient use of technicians and resources, as they are dealing with gradual deterioration and potential future failures, which are not as time-critical.

Current fault volume reduction and performance enhancement processes rely heavily on human interpretation of the data.

The present invention provides apparatus for diagnosing fault conditions in a distributed system, having a network condition monitoring system for receiving measurement data from a plurality of reporting points in the distributed system, a network condition database, and a performance testing system for retrieving performance data from the network condition monitoring system to be stored in the network condition database, further comprising a fault logging system for reporting faults relating to a plurality of termination points connected to, but remote from, the reporting points, and a correlation system for comparing faults reported by the fault logging system with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with one or more of the fault reports.

The invention also provides a method for diagnosing fault conditions in a distributed system, wherein network condition reports are collected in a network condition database by a network condition monitor taking measurements at a plurality of respective reporting points, and reports of faults relating to a plurality of termination points connected to, but remote from, the reporting points are recorded in a fault logging system, wherein faults recorded by the fault logging system are correlated with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with the fault reports.

Thus a process is provided that links reported faults to root cause network issues. A link is formed between the network components shared by the terminations, thus facilitating the repair of end-user faults and the root cause in one visit. Where required it expedites fault volume reduction activity where specialist stores and expertise are required.

The invention therefore provides advantages by joining the activities and underlying data sources together. This gives rise to efficiencies across the programme and its technical resources.

The apparatus may further comprise a task management system for generating instructions to attend to the faults identified by the fault logging system, the task management system being further arranged so that performance-related tasks to be attended to together with the fault rectification tasks can be identified from the network condition database.

The relatedness of fault reports and network conditions, and the identification of related tasks, are identified by reference to a database of relationships between individual elements of the distributed system. The apparatus may further comprise a test probing system for periodically interrogating the reporting points for network condition reports.

In the case of proactive repair, a network can be defined as a series of end connection points e.g. PCP (Cabinet or Primary Cross-Connection Point) and DPs (Distribution Points), which can be used as demarcations to target investment and repair. Typically, the network topology is recorded as a series of cable sections, each running from one structure or connection point to another. The database is typically derived electronically as part of the planning process prior to installation or, for older installations, by digitisation of existing network drawings, and as well as the end points may record other data such as grade and composition of the cable, whether it is aerial or underground, etc. The network can be resolved down to a resolution of groups of 100 lines (end-users). In reality each such group is itself an aggregation of numerous joint closures and cables (e.g. one representation could be five sets of 20-pair cables). This means a technician has to locate which part of the 5 x 20 = 100 lines to focus upon when identifying faults and conducting repairs. The present invention allows issues to be identified down to individual cable level. This makes for a more reliable diagnosis of issues and reduces the time taken to investigate.

These existing proactive repair systems rely on the manual analysis and interpretation of historic fault and test and diagnostic data to identify the parts of the network that are at the highest risk of failure. Once identified, work to conduct proactive uplift has to be planned, resourced and allocated. The present invention automates the review of these data sets but also adds real-time end-user fault information and field-based observations to provide a better means of prioritising the work required.

Conversely, existing reactive repair systems focus on the single end-user- reported fault and the supplying circuit. The present invention consolidates information relating to the surrounding network and this allows the technician to make a more informed judgement on what is the likely root cause to deliver a more holistic repair.

In short, the invention provides a symbiotic means of delivering responses to end-user fault reports and a proactive maintenance programme. This considered, the operator can wait until an end-user-reported fault occurs in the same location as a suspect network component. This serves to increase the likelihood of predicting the point of failure with the maximum return on investment when resolved. This in turn allows the identification of the early signs of potential failure in the network but with enough confidence to allow time to carry out a lasting repair.

The technician is provided with multiple test results from the proactive testing regime, previous fault history information, and recorded defects and displays it in a graphical user interface with a map. This provides the technician with a much richer picture of network health to support the identification of a root cause and permanent solution for the end-user.

An embodiment of the invention will now be described by way of example, with reference to the Figures, in which:

Figure 1 is a schematic representation of a simplified communications network incorporating a fault diagnosis system operating according to an embodiment of the invention

Figures 2 and 3 illustrate an example of a first situation in which fault diagnosis can be facilitated by use of this embodiment of the invention

Figures 4 and 5 illustrate an example of a second situation in which fault diagnosis can be facilitated by use of this embodiment of the invention

Figure 6 is a schematic representation of a fault identification process operating according to the invention

Figure 7 is a schematic overview of a task management process making use of the fault identification process of Figure 6

In this embodiment the necessary operations and systems are embodied in software running on one or more server platforms. The embodiment brings together reactive and proactive activities and systems to deliver a more efficient maintenance process.

Essentially it involves "drilling down" fault reports to correlate to the cable sections serving the termination reporting the fault, to look for correlations in fault reports which relate to terminations which share a section "upstream", and the correlation of such user-end reports with routine network test data, which are generally measured from the network end, to identify elements of the network that are deteriorating, again using the network topology as correlation.

Live fault reports are matched to a prioritised list of existing network problems (faults, defects and suspect lines from overnight routine testing, identifying lines that are suspect from an algorithm that uses conditions and a scoring method to grade electrical health. The technician then receives, in addition to the actual end-user fault report, either a fault volume reduction (FVR) task or, if the problem is not yet identified, a survey task, to be carried out alongside the rectification of the prioritised end-user-reported fault. This allows suspect lines to be prioritised according to their effect on the service end-users are experiencing.

As depicted in Figure 1 , a network, generally indicated at 1 , comprises a number of primary nodes 2, 3, 4, each connected, directly or indirectly, to a number of respective secondary nodes 20, 201 , 202, 21 , 21 1 , 23, 24; 30, 31 : 40, 401 , 41 , 41 1 , and thus to a number of user terminations 201 1 , 2012, 2013, 2021 , 2022, 2101 , 21 1 1 , 21 12, 3001 , 3002, 3003, 3101 , 3102, 401 1 , 4012, 4013, 4101 , 4102, 41 1 1 , 41 12.

A network condition monitoring system 5 is arranged to receive fault reports logged by the primary nodes 2, 3, 4, either in response to condition report signals triggered by predetermined fault conditions, or in response to probe signals transmitted from the network condition monitoring system, and store the results of such tests in a database 55.

In addition, a user fault report logging system (UFRL) 6 records faults associated with individual network terminations e.g. 201 1 , 2012, ....41 12. Such reports may be logged automatically, by fault logging applications in the individual terminations e.g. 3001 , by transmitting reports through the network 1 and a connection 600. However, many faults are of a type which are either not detectable by the equipment itself, or result in disconnection from the network 1 , and so in most cases fault reports are generated manually through a user interface 60 in response to a user identifying a fault and reporting it to a human-operated helpdesk or by logging the fault using another communications system, for example reporting a fault on a landline by using a wireless internet connection to connect to the fault logging database 6.

In prior art systems, problems reported at the network end 5 are handled in a different manner from faults reported at the user end 6. In general, network conditions are monitored routinely and trends are determined, to build up a database 55 of the general condition of the network, so that intervention can be made pro-actively before a deteriorating condition becomes bad enough to impair service to customers. Some faults may be noticeable to customers as a gradual deterioration in service quality, but others may not manifest themselves to the customer at all, provided a backup system is available - unless, and until, the backup system fails too. Having a network-wide view of conditions allows for efficient scheduling of maintenance tasks, by scheduling tasks according to priorities such as the rate of deterioration, technician availability, and manpower deployment, for example scheduling a low-priority task to be done at the same time as a higher-priority task at the same location.

On the other hand, all user-reported faults are treated as high priority because they have, by definition, already had an effect on the end-user's ability to use the service. Such a fault reporting and management system is necessarily reactive. It will be noted in particular that diagnosis is made more complicated because only positive reports of faults are received from the user end - absence of a fault report from a given user, e.g. 4101 , does not necessarily indicate that there is no fault on the node 41 connecting that user to the network. The user may not have tried to use the connection recently, or not to the extent necessary for the fault to be apparent. In particular, the fault may be intermittent, or only evident when several users are trying connect through the same intermediate nodes. Thus, referring to Figure 1 , the presence of fault reports from terminals 4102, 41 1 1 and 41 12 may indicate a fault at one of the nodes 4, 41 serving all three terminals, or it may indicate two separate faults: one at the node 41 1 serving both nodes 41 1 1 and 41 12, and a second at node 4102 itself. Nothing can be deduced from the absence of a fault report from the node 4101 .

Figures 2, 3, 4 and 5 illustrate how the network condition database 55 and user fault logging system 6 both have incomplete views of the condition of the network and in particular the location of a fault. These Figures all depict a part of the network 1 , namely that part connected to one of the secondary nodes 20. The network conditioning monitoring system 5 monitors this part of the network through the respective primary node 2. Any faults reported by the users are logged in the user fault reporting system 6. In each of these examples, a fault is present on one or more of the secondary nodes 20, 201 , 202, but neither the network-end database 55 nor the user-facing end 6 has a sufficiently complete picture to determine exactly which node or nodes is or are the problem. In these figures, solid triangles represent end-user terminals in respect of which faults have been reported to the UFRL 6, and solid circles represent the node or nodes that require attention.

As seen in Figure 2, end-users 201 1 , 2013 and 2021 all report a connection fault to the UFRL 6. As their first common point of connection to the network is the node 20, this pattern of faults suggests a fault at that node, and a reactive response system would initiate a site visit to investigate. Note that, as the fault reporting system at the customer end is reactive, the system can only deduce the presence of faults from positive reports - the absence of fault reports from other end-users 2012, 2022, connected through the node 20 is not evidence that the faults are more localised: the respective users may not have observed the fault, for example because they are not using a service on which the fault is evident.

However, as seen in Figure 3, the same pattern of user fault reports (201 1 , 2013, 2021 ) may be caused by two separate faults deeper into the network (closer to the end users), in this example at nodes 201 and 202. In this case, a site visit to the common point of connection at 20 would be waste of time and resources, as the faults are elsewhere.

Such ambiguities can be resolved, in this embodiment, by matching the fault reports logged at the UFRL 6 in respect of the user terminations 201 1 , 2013, 2021 with the view of the network as seen from the network condition database 55. In the situation depicted in Figure 2, the network condition database 55 has, as a result of routine probes of the network, discovered suboptimal performance at the node 20, which confirms that the faults reported at 201 1 , 2013, 2021 have a common cause. Conversely, in the situation depicted in Figure 3, no fault is identified at the common node 20, indicating that the faults reported to the UFRL 6 do not have a common cause. As potential faults have been logged at both nodes 201 and 202, it is those which should be attended to.

Comparing now Figures 4 and 5, the network condition database 55 has transmitted a probe message down the connection 222 to the node 20 serving the part of the network under discussion, from which a negative or errored response 50 (Figures 1 , 6) has been received, but the nature of the fault is such that its precise location cannot be determined from the network end. By matching the error to customer reports it can be possible to identify the affected node more precisely. In Figure 5, fault reports are only arriving at the UFRL 6 from user terminals connected to one of the secondary nodes - node 201 - which is indicative that it is likely the fault is in that node, whereas in Figure 4 fault reports are arriving from terminations connected to more than one of the secondary nodes, making it more likely that the fault is in the primary node 20.

It will be appreciated that the examples given here are very much simpler than real-life situations, which may involve very much more complex network topographies such as rings, duplicate paths, etc.

As depicted in Figure 1 , this embodiment operates using a pattern matching system 7 to compare fault reports logged by the user fault report logging system 6 with the current condition status of the various network elements stored in the network condition database 55, by reference to a network model 8. The pattern matching system provides an output 90 to a task management system 9 for allocating tasks to maintenance and repair personnel.

Figure 6 is a schematic diagram illustrating the various information flows taking place in the embodiment. Figure 7 is a schematic diagram illustrating task allocation process using fault identification process of Figure 6. The process consists of the following steps:

Routine tests 50 carried out on the network 2 are used by the network condition monitor 5 to identify suspect lines, which are reported (step 51 ) to the network condition database 55. These suspect lines are identified (step 5) from an algorithm that uses conditions and a scoring method to grade its electrical health. When a live fault report 60 is delivered by the user fault report logger system 6, it is matched (77) to a prioritised list of existing network problems (faults, defects and suspect lines from overnight routine testing) stored in the network condition database 55. By combining data from overnight tests, fault volume reduction and network inventory systems it is possible to identify the cable sections deemed at a high risk of failure. The system then matches the cable sections to the first in-bound fault along the same route.

The potentially relevant parts of the network 2 are identified from the network model 8 (step 80) and the fault report 60 is compared with data 52 from the network condition database 55 relating to those parts of the network 2 to identify the most likely location of the cause of the faults reported by the users.

These faults are then prioritised (71 ), for example by severity, number of end-users affected, availability of backup capability, or safety-critical functions, and a list of fault rectification tasks is generated (72) which are reported (70) to a task management system 9. The task management system 9 co-ordinates data relating to equipment which requires investigation (92), reported faults (93) and defects identified from network tests (94) to generate instructions 90 to rectify the faults, along with any related network management tasks 91 which may conveniently be done at the same time. Such management tasks 91 would typically be investigation of impaired performance which has not yet caused any customer fault reports, routine maintenance, etc. These instructions are then transmitted to the technical staff through an interface 95. The technicians, having carried out the repair work specified in the fault rectification instruction 90, can perform any related survey tasks 91 and, if necessary, report on the network condition at that location and identify any further work that may be necessary (96) which is use to update the network condition database 55.

This embodiment can therefore be used to allow technicians to identify and undertake fault volume reduction work whilst attending a live fault.

A technician then receives either a fault volume reduction (FVR) task (90) or a survey task (91 ). The technician will fix the end-user fault as a priority then carries out the FVR or survey task. The integration of user fault report logging and network condition data allows information to be presented to the technician so that - lo ne can repair both the immediate fault and the root cause of failure in the network in a single visit. The technician can therefore be presented with all of the test, network inventory and defect information to fix the fault (reactive part). The technician can survey and address the underlying root-cause and, where appropriate, proactively carry out any required fault volume reduction work.

The embodiment has been described in relation to a telecommunications network, but the principle may be applied in other utility-based industries where inventory data and historical data can be used, for example leak detection and repair in oil, gas and water applications, or in highway maintenance regime of potholes, by comparing real-time monitoring feeds with in-bound defect reports, to then conduct resurfacing activities.

Claims

1 . Apparatus for diagnosing fault conditions in a distributed system having a network condition monitoring system for receiving measurement data from a plurality of reporting points in the distributed system, a network condition database, and a performance testing system for retrieving performance data from the network condition monitor to be stored in the network condition database, further comprising a fault logging system for reporting faults relating to a plurality of termination points connected to, but remote from, the reporting points, and a correlation system for comparing faults reported by the fault logging system with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with one or more of the fault reports.

Apparatus according to Claim 1 , further comprising a task management system for generating instructions to attend to a reported fault identified by the fault logging system, the task management system being further arranged to identify from the network condition database any related performance-related tasks to be attended to together with the reported fault.

Apparatus according to Claim 1 or Claim 2, further comprising a database of relationships between individual elements of the distributed system, wherein the correlation system identifies the relatedness of fault reports and network conditions, and the identification of related tasks, by reference to the database.

Apparatus according to Claim 3, wherein the database stores a record of the topology of the distributed system, and of connections between the individual elements Apparatus according to Claim 1 , Claim 2, Claim 3 or Claim 4, further comprising a test probing system for periodically interrogating the reporting points for network condition reports.

A method for diagnosing fault conditions in a distributed system, wherein network condition reports are collected in a network condition database by a network condition monitor taking measurements at a plurality of respective reporting points, and reports of faults relating to a plurality of termination points connected to, but remote from, the reporting points are recorded in a fault logging system, wherein faults recorded by the fault logging system are correlated with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with the fault reports.

A method according to Claim 6, wherein instructions to attend to the faults identified by the fault logging system are generated, and wherein performance-related tasks are identified from the network condition database for attention together with rectification of the reported faults are identified from the network condition database.

A method according to Claim 6 or Claim 7 wherein the relatedness of fault reports and network conditions, and the identification of related tasks, are identified by reference to a database of relationships between individual elements of the distributed system.

A method according to Claim 8, wherein the database stores a record of the topology of the distributed system, and of connections between the individual elements 10. A method according to Claim 6, Claim 7, Claim 8 or Claim 9, wherein the reporting points are periodically interrogated for network condition reports.