WO2017050621A1 - Fault diagnosis - Google Patents

Fault diagnosis Download PDF

Info

Publication number
WO2017050621A1
WO2017050621A1 PCT/EP2016/071722 EP2016071722W WO2017050621A1 WO 2017050621 A1 WO2017050621 A1 WO 2017050621A1 EP 2016071722 W EP2016071722 W EP 2016071722W WO 2017050621 A1 WO2017050621 A1 WO 2017050621A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
network
network condition
reports
database
Prior art date
Application number
PCT/EP2016/071722
Other languages
French (fr)
Inventor
Gerard POWIS
Original Assignee
British Telecommunications Public Limited Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by British Telecommunications Public Limited Company filed Critical British Telecommunications Public Limited Company
Publication of WO2017050621A1 publication Critical patent/WO2017050621A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5061Network service management, e.g. ensuring proper service fulfilment according to agreements characterised by the interaction between service providers and their network customers, e.g. customer relationship management
    • H04L41/5074Handling of user complaints or trouble tickets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes

Definitions

  • This invention relates to diagnosis of faults in networks and other distributed systems. It can often be very difficult to diagnose the root causes of reported faults in such a system as they may be manifested as several individual fault reports made at different locations which may be remote from each other and from the actual cause of the problem.
  • Current fault identification methods rely heavily on previous fault history, but it takes time to identify, plan and carry out the work required.
  • Typical repair processes are classified as either reactive - an end-user enquiry prompts the network operator to run a line test, or to dispatch a technician to repair the fault, or as proactive - an existing fault history (closed faults and results of routine line tests) is analysed and a technician sent to survey the network to confirm the need for preventative maintenance to prevent further problems or arrest a deteriorating trend, and then carry out the required preventative maintenance.
  • Reactive processes are generally handled on an individual basis because they relate to failures that have already happened and service needs to be restored as soon as possible,, whereas proactive processes can be scheduled in such a way as to make the most efficient use of technicians and resources, as they are dealing with gradual deterioration and potential future failures, which are not as time-critical.
  • the present invention provides apparatus for diagnosing fault conditions in a distributed system, having a network condition monitoring system for receiving measurement data from a plurality of reporting points in the distributed system, a network condition database, and a performance testing system for retrieving performance data from the network condition monitoring system to be stored in the network condition database, further comprising a fault logging system for reporting faults relating to a plurality of termination points connected to, but remote from, the reporting points, and a correlation system for comparing faults reported by the fault logging system with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with one or more of the fault reports.
  • the invention also provides a method for diagnosing fault conditions in a distributed system, wherein network condition reports are collected in a network condition database by a network condition monitor taking measurements at a plurality of respective reporting points, and reports of faults relating to a plurality of termination points connected to, but remote from, the reporting points are recorded in a fault logging system, wherein faults recorded by the fault logging system are correlated with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with the fault reports.
  • a link is formed between the network components shared by the terminations, thus facilitating the repair of end-user faults and the root cause in one visit. Where required it expedites fault volume reduction activity where specialist stores and expertise are required.
  • the invention therefore provides advantages by joining the activities and underlying data sources together. This gives rise to efficiencies across the programme and its technical resources.
  • the apparatus may further comprise a task management system for generating instructions to attend to the faults identified by the fault logging system, the task management system being further arranged so that performance-related tasks to be attended to together with the fault rectification tasks can be identified from the network condition database.
  • the relatedness of fault reports and network conditions, and the identification of related tasks, are identified by reference to a database of relationships between individual elements of the distributed system.
  • the apparatus may further comprise a test probing system for periodically interrogating the reporting points for network condition reports.
  • a network can be defined as a series of end connection points e.g. PCP (Cabinet or Primary Cross-Connection Point) and DPs (Distribution Points), which can be used as demarcations to target investment and repair.
  • PCP Combint or Primary Cross-Connection Point
  • DPs Distribution Points
  • the network topology is recorded as a series of cable sections, each running from one structure or connection point to another.
  • the database is typically derived electronically as part of the planning process prior to installation or, for older installations, by digitisation of existing network drawings, and as well as the end points may record other data such as grade and composition of the cable, whether it is aerial or underground, etc.
  • the network can be resolved down to a resolution of groups of 100 lines (end-users).
  • each such group is itself an aggregation of numerous joint closures and cables (e.g. one representation could be five sets of 20-pair cables).
  • the present invention allows issues to be identified down to individual cable level. This makes for a more reliable diagnosis of issues and reduces the time taken to investigate.
  • the invention provides a symbiotic means of delivering responses to end-user fault reports and a proactive maintenance programme. This considered, the operator can wait until an end-user-reported fault occurs in the same location as a suspect network component. This serves to increase the likelihood of predicting the point of failure with the maximum return on investment when resolved. This in turn allows the identification of the early signs of potential failure in the network but with enough confidence to allow time to carry out a lasting repair.
  • the technician is provided with multiple test results from the proactive testing regime, previous fault history information, and recorded defects and displays it in a graphical user interface with a map. This provides the technician with a much richer picture of network health to support the identification of a root cause and permanent solution for the end-user.
  • Figure 1 is a schematic representation of a simplified communications network incorporating a fault diagnosis system operating according to an embodiment of the invention
  • Figures 2 and 3 illustrate an example of a first situation in which fault diagnosis can be facilitated by use of this embodiment of the invention
  • Figures 4 and 5 illustrate an example of a second situation in which fault diagnosis can be facilitated by use of this embodiment of the invention
  • FIG. 6 is a schematic representation of a fault identification process operating according to the invention
  • Figure 7 is a schematic overview of a task management process making use of the fault identification process of Figure 6
  • the necessary operations and systems are embodied in software running on one or more server platforms.
  • the embodiment brings together reactive and proactive activities and systems to deliver a more efficient maintenance process.
  • fault reports to correlate to the cable sections serving the termination reporting the fault, to look for correlations in fault reports which relate to terminations which share a section "upstream”, and the correlation of such user-end reports with routine network test data, which are generally measured from the network end, to identify elements of the network that are deteriorating, again using the network topology as correlation.
  • Live fault reports are matched to a prioritised list of existing network problems (faults, defects and suspect lines from overnight routine testing, identifying lines that are suspect from an algorithm that uses conditions and a scoring method to grade electrical health.
  • the technician then receives, in addition to the actual end-user fault report, either a fault volume reduction (FVR) task or, if the problem is not yet identified, a survey task, to be carried out alongside the rectification of the prioritised end-user-reported fault. This allows suspect lines to be prioritised according to their effect on the service end-users are experiencing.
  • FVR fault volume reduction
  • a network comprises a number of primary nodes 2, 3, 4, each connected, directly or indirectly, to a number of respective secondary nodes 20, 201 , 202, 21 , 21 1 , 23, 24; 30, 31 : 40, 401 , 41 , 41 1 , and thus to a number of user terminations 201 1 , 2012, 2013, 2021 , 2022, 2101 , 21 1 1 , 21 12, 3001 , 3002, 3003, 3101 , 3102, 401 1 , 4012, 4013, 4101 , 4102, 41 1 1 , 41 12.
  • a network condition monitoring system 5 is arranged to receive fault reports logged by the primary nodes 2, 3, 4, either in response to condition report signals triggered by predetermined fault conditions, or in response to probe signals transmitted from the network condition monitoring system, and store the results of such tests in a database 55.
  • a user fault report logging system (UFRL) 6 records faults associated with individual network terminations e.g. 201 1 , 2012, ....41 12. Such reports may be logged automatically, by fault logging applications in the individual terminations e.g. 3001 , by transmitting reports through the network 1 and a connection 600.
  • fault reports are generated manually through a user interface 60 in response to a user identifying a fault and reporting it to a human-operated helpdesk or by logging the fault using another communications system, for example reporting a fault on a landline by using a wireless internet connection to connect to the fault logging database 6.
  • problems reported at the network end 5 are handled in a different manner from faults reported at the user end 6.
  • network conditions are monitored routinely and trends are determined, to build up a database 55 of the general condition of the network, so that intervention can be made pro-actively before a deteriorating condition becomes bad enough to impair service to customers.
  • Some faults may be noticeable to customers as a gradual deterioration in service quality, but others may not manifest themselves to the customer at all, provided a backup system is available - unless, and until, the backup system fails too.
  • Having a network-wide view of conditions allows for efficient scheduling of maintenance tasks, by scheduling tasks according to priorities such as the rate of deterioration, technician availability, and manpower deployment, for example scheduling a low-priority task to be done at the same time as a higher-priority task at the same location.
  • the presence of fault reports from terminals 4102, 41 1 1 and 41 12 may indicate a fault at one of the nodes 4, 41 serving all three terminals, or it may indicate two separate faults: one at the node 41 1 serving both nodes 41 1 1 and 41 12, and a second at node 4102 itself. None can be deduced from the absence of a fault report from the node 4101 .
  • FIGs 2, 3, 4 and 5 illustrate how the network condition database 55 and user fault logging system 6 both have incomplete views of the condition of the network and in particular the location of a fault.
  • These Figures all depict a part of the network 1 , namely that part connected to one of the secondary nodes 20.
  • the network conditioning monitoring system 5 monitors this part of the network through the respective primary node 2. Any faults reported by the users are logged in the user fault reporting system 6.
  • a fault is present on one or more of the secondary nodes 20, 201 , 202, but neither the network-end database 55 nor the user-facing end 6 has a sufficiently complete picture to determine exactly which node or nodes is or are the problem.
  • solid triangles represent end-user terminals in respect of which faults have been reported to the UFRL 6, and solid circles represent the node or nodes that require attention.
  • end-users 201 1 , 2013 and 2021 all report a connection fault to the UFRL 6.
  • this pattern of faults suggests a fault at that node, and a reactive response system would initiate a site visit to investigate.
  • the fault reporting system at the customer end is reactive, the system can only deduce the presence of faults from positive reports - the absence of fault reports from other end-users 2012, 2022, connected through the node 20 is not evidence that the faults are more localised: the respective users may not have observed the fault, for example because they are not using a service on which the fault is evident.
  • the same pattern of user fault reports may be caused by two separate faults deeper into the network (closer to the end users), in this example at nodes 201 and 202.
  • a site visit to the common point of connection at 20 would be waste of time and resources, as the faults are elsewhere.
  • Such ambiguities can be resolved, in this embodiment, by matching the fault reports logged at the UFRL 6 in respect of the user terminations 201 1 , 2013, 2021 with the view of the network as seen from the network condition database 55.
  • the network condition database 55 has, as a result of routine probes of the network, discovered suboptimal performance at the node 20, which confirms that the faults reported at 201 1 , 2013, 2021 have a common cause.
  • no fault is identified at the common node 20, indicating that the faults reported to the UFRL 6 do not have a common cause.
  • potential faults have been logged at both nodes 201 and 202, it is those which should be attended to.
  • the network condition database 55 has transmitted a probe message down the connection 222 to the node 20 serving the part of the network under discussion, from which a negative or errored response 50 ( Figures 1 , 6) has been received, but the nature of the fault is such that its precise location cannot be determined from the network end. By matching the error to customer reports it can be possible to identify the affected node more precisely.
  • fault reports are only arriving at the UFRL 6 from user terminals connected to one of the secondary nodes - node 201 - which is indicative that it is likely the fault is in that node, whereas in Figure 4 fault reports are arriving from terminations connected to more than one of the secondary nodes, making it more likely that the fault is in the primary node 20.
  • this embodiment operates using a pattern matching system 7 to compare fault reports logged by the user fault report logging system 6 with the current condition status of the various network elements stored in the network condition database 55, by reference to a network model 8.
  • the pattern matching system provides an output 90 to a task management system 9 for allocating tasks to maintenance and repair personnel.
  • Figure 6 is a schematic diagram illustrating the various information flows taking place in the embodiment.
  • Figure 7 is a schematic diagram illustrating task allocation process using fault identification process of Figure 6. The process consists of the following steps:
  • Routine tests 50 carried out on the network 2 are used by the network condition monitor 5 to identify suspect lines, which are reported (step 51 ) to the network condition database 55. These suspect lines are identified (step 5) from an algorithm that uses conditions and a scoring method to grade its electrical health.
  • a live fault report 60 is delivered by the user fault report logger system 6, it is matched (77) to a prioritised list of existing network problems (faults, defects and suspect lines from overnight routine testing) stored in the network condition database 55.
  • a live fault report 60 is delivered by the user fault report logger system 6
  • it is matched (77) to a prioritised list of existing network problems (faults, defects and suspect lines from overnight routine testing) stored in the network condition database 55.
  • fault volume reduction and network inventory systems it is possible to identify the cable sections deemed at a high risk of failure. The system then matches the cable sections to the first in-bound fault along the same route.
  • the potentially relevant parts of the network 2 are identified from the network model 8 (step 80) and the fault report 60 is compared with data 52 from the network condition database 55 relating to those parts of the network 2 to identify the most likely location of the cause of the faults reported by the users.
  • faults are then prioritised (71 ), for example by severity, number of end-users affected, availability of backup capability, or safety-critical functions, and a list of fault rectification tasks is generated (72) which are reported (70) to a task management system 9.
  • the task management system 9 co-ordinates data relating to equipment which requires investigation (92), reported faults (93) and defects identified from network tests (94) to generate instructions 90 to rectify the faults, along with any related network management tasks 91 which may conveniently be done at the same time.
  • Such management tasks 91 would typically be investigation of impaired performance which has not yet caused any customer fault reports, routine maintenance, etc.
  • the technicians having carried out the repair work specified in the fault rectification instruction 90, can perform any related survey tasks 91 and, if necessary, report on the network condition at that location and identify any further work that may be necessary (96) which is use to update the network condition database 55.
  • This embodiment can therefore be used to allow technicians to identify and undertake fault volume reduction work whilst attending a live fault.
  • a technician then receives either a fault volume reduction (FVR) task (90) or a survey task (91 ).
  • the technician will fix the end-user fault as a priority then carries out the FVR or survey task.
  • the integration of user fault report logging and network condition data allows information to be presented to the technician so that - lo ne can repair both the immediate fault and the root cause of failure in the network in a single visit.
  • the technician can therefore be presented with all of the test, network inventory and defect information to fix the fault (reactive part).
  • the technician can survey and address the underlying root-cause and, where appropriate, proactively carry out any required fault volume reduction work.
  • the embodiment has been described in relation to a telecommunications network, but the principle may be applied in other utility-based industries where inventory data and historical data can be used, for example leak detection and repair in oil, gas and water applications, or in highway maintenance regime of potholes, by comparing real-time monitoring feeds with in-bound defect reports, to then conduct resurfacing activities.

Abstract

Fault conditions in a distributed system (2) are diagnosed by correlation of routine network condition reports (50) collected by the network and stored in a database 55, with reports of faults (60) identified at remote termination points, typically by end-user, which are recorded in a fault logging system (6). Correlation of fault or condition reports, by reference to the topology (8) of the system allows better identification (7) of the cause of a fault, and allows co-ordination of routine maintenance and inspection tasks (91) with ad hoc task management (90) to attend to faults recorded by the fault logging system.

Description

FAULT DIAGNOSIS
This invention relates to diagnosis of faults in networks and other distributed systems. It can often be very difficult to diagnose the root causes of reported faults in such a system as they may be manifested as several individual fault reports made at different locations which may be remote from each other and from the actual cause of the problem. Current fault identification methods rely heavily on previous fault history, but it takes time to identify, plan and carry out the work required.
For example, in a telecommunications network, the attempt to predict end- user faults using individual tests across tens of millions of lines, and to interpret those results, can trigger many false correlations. (For example, two fault reports from terminations connected to the same distribution point may be the result of a fault at the distribution point itself, but it may also be the result of two quite unrelated problems on the two individual lines).
Typical repair processes are classified as either reactive - an end-user enquiry prompts the network operator to run a line test, or to dispatch a technician to repair the fault, or as proactive - an existing fault history (closed faults and results of routine line tests) is analysed and a technician sent to survey the network to confirm the need for preventative maintenance to prevent further problems or arrest a deteriorating trend, and then carry out the required preventative maintenance. Reactive processes are generally handled on an individual basis because they relate to failures that have already happened and service needs to be restored as soon as possible,, whereas proactive processes can be scheduled in such a way as to make the most efficient use of technicians and resources, as they are dealing with gradual deterioration and potential future failures, which are not as time-critical.
Current fault volume reduction and performance enhancement processes rely heavily on human interpretation of the data.
The present invention provides apparatus for diagnosing fault conditions in a distributed system, having a network condition monitoring system for receiving measurement data from a plurality of reporting points in the distributed system, a network condition database, and a performance testing system for retrieving performance data from the network condition monitoring system to be stored in the network condition database, further comprising a fault logging system for reporting faults relating to a plurality of termination points connected to, but remote from, the reporting points, and a correlation system for comparing faults reported by the fault logging system with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with one or more of the fault reports.
The invention also provides a method for diagnosing fault conditions in a distributed system, wherein network condition reports are collected in a network condition database by a network condition monitor taking measurements at a plurality of respective reporting points, and reports of faults relating to a plurality of termination points connected to, but remote from, the reporting points are recorded in a fault logging system, wherein faults recorded by the fault logging system are correlated with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with the fault reports.
Thus a process is provided that links reported faults to root cause network issues. A link is formed between the network components shared by the terminations, thus facilitating the repair of end-user faults and the root cause in one visit. Where required it expedites fault volume reduction activity where specialist stores and expertise are required.
The invention therefore provides advantages by joining the activities and underlying data sources together. This gives rise to efficiencies across the programme and its technical resources.
The apparatus may further comprise a task management system for generating instructions to attend to the faults identified by the fault logging system, the task management system being further arranged so that performance-related tasks to be attended to together with the fault rectification tasks can be identified from the network condition database.
The relatedness of fault reports and network conditions, and the identification of related tasks, are identified by reference to a database of relationships between individual elements of the distributed system. The apparatus may further comprise a test probing system for periodically interrogating the reporting points for network condition reports.
In the case of proactive repair, a network can be defined as a series of end connection points e.g. PCP (Cabinet or Primary Cross-Connection Point) and DPs (Distribution Points), which can be used as demarcations to target investment and repair. Typically, the network topology is recorded as a series of cable sections, each running from one structure or connection point to another. The database is typically derived electronically as part of the planning process prior to installation or, for older installations, by digitisation of existing network drawings, and as well as the end points may record other data such as grade and composition of the cable, whether it is aerial or underground, etc. The network can be resolved down to a resolution of groups of 100 lines (end-users). In reality each such group is itself an aggregation of numerous joint closures and cables (e.g. one representation could be five sets of 20-pair cables). This means a technician has to locate which part of the 5 x 20 = 100 lines to focus upon when identifying faults and conducting repairs. The present invention allows issues to be identified down to individual cable level. This makes for a more reliable diagnosis of issues and reduces the time taken to investigate.
These existing proactive repair systems rely on the manual analysis and interpretation of historic fault and test and diagnostic data to identify the parts of the network that are at the highest risk of failure. Once identified, work to conduct proactive uplift has to be planned, resourced and allocated. The present invention automates the review of these data sets but also adds real-time end-user fault information and field-based observations to provide a better means of prioritising the work required.
Conversely, existing reactive repair systems focus on the single end-user- reported fault and the supplying circuit. The present invention consolidates information relating to the surrounding network and this allows the technician to make a more informed judgement on what is the likely root cause to deliver a more holistic repair.
In short, the invention provides a symbiotic means of delivering responses to end-user fault reports and a proactive maintenance programme. This considered, the operator can wait until an end-user-reported fault occurs in the same location as a suspect network component. This serves to increase the likelihood of predicting the point of failure with the maximum return on investment when resolved. This in turn allows the identification of the early signs of potential failure in the network but with enough confidence to allow time to carry out a lasting repair.
The technician is provided with multiple test results from the proactive testing regime, previous fault history information, and recorded defects and displays it in a graphical user interface with a map. This provides the technician with a much richer picture of network health to support the identification of a root cause and permanent solution for the end-user.
An embodiment of the invention will now be described by way of example, with reference to the Figures, in which:
Figure 1 is a schematic representation of a simplified communications network incorporating a fault diagnosis system operating according to an embodiment of the invention
Figures 2 and 3 illustrate an example of a first situation in which fault diagnosis can be facilitated by use of this embodiment of the invention
Figures 4 and 5 illustrate an example of a second situation in which fault diagnosis can be facilitated by use of this embodiment of the invention
Figure 6 is a schematic representation of a fault identification process operating according to the invention
Figure 7 is a schematic overview of a task management process making use of the fault identification process of Figure 6
In this embodiment the necessary operations and systems are embodied in software running on one or more server platforms. The embodiment brings together reactive and proactive activities and systems to deliver a more efficient maintenance process.
Essentially it involves "drilling down" fault reports to correlate to the cable sections serving the termination reporting the fault, to look for correlations in fault reports which relate to terminations which share a section "upstream", and the correlation of such user-end reports with routine network test data, which are generally measured from the network end, to identify elements of the network that are deteriorating, again using the network topology as correlation.
Live fault reports are matched to a prioritised list of existing network problems (faults, defects and suspect lines from overnight routine testing, identifying lines that are suspect from an algorithm that uses conditions and a scoring method to grade electrical health. The technician then receives, in addition to the actual end-user fault report, either a fault volume reduction (FVR) task or, if the problem is not yet identified, a survey task, to be carried out alongside the rectification of the prioritised end-user-reported fault. This allows suspect lines to be prioritised according to their effect on the service end-users are experiencing.
As depicted in Figure 1 , a network, generally indicated at 1 , comprises a number of primary nodes 2, 3, 4, each connected, directly or indirectly, to a number of respective secondary nodes 20, 201 , 202, 21 , 21 1 , 23, 24; 30, 31 : 40, 401 , 41 , 41 1 , and thus to a number of user terminations 201 1 , 2012, 2013, 2021 , 2022, 2101 , 21 1 1 , 21 12, 3001 , 3002, 3003, 3101 , 3102, 401 1 , 4012, 4013, 4101 , 4102, 41 1 1 , 41 12.
A network condition monitoring system 5 is arranged to receive fault reports logged by the primary nodes 2, 3, 4, either in response to condition report signals triggered by predetermined fault conditions, or in response to probe signals transmitted from the network condition monitoring system, and store the results of such tests in a database 55.
In addition, a user fault report logging system (UFRL) 6 records faults associated with individual network terminations e.g. 201 1 , 2012, ....41 12. Such reports may be logged automatically, by fault logging applications in the individual terminations e.g. 3001 , by transmitting reports through the network 1 and a connection 600. However, many faults are of a type which are either not detectable by the equipment itself, or result in disconnection from the network 1 , and so in most cases fault reports are generated manually through a user interface 60 in response to a user identifying a fault and reporting it to a human-operated helpdesk or by logging the fault using another communications system, for example reporting a fault on a landline by using a wireless internet connection to connect to the fault logging database 6.
In prior art systems, problems reported at the network end 5 are handled in a different manner from faults reported at the user end 6. In general, network conditions are monitored routinely and trends are determined, to build up a database 55 of the general condition of the network, so that intervention can be made pro-actively before a deteriorating condition becomes bad enough to impair service to customers. Some faults may be noticeable to customers as a gradual deterioration in service quality, but others may not manifest themselves to the customer at all, provided a backup system is available - unless, and until, the backup system fails too. Having a network-wide view of conditions allows for efficient scheduling of maintenance tasks, by scheduling tasks according to priorities such as the rate of deterioration, technician availability, and manpower deployment, for example scheduling a low-priority task to be done at the same time as a higher-priority task at the same location.
On the other hand, all user-reported faults are treated as high priority because they have, by definition, already had an effect on the end-user's ability to use the service. Such a fault reporting and management system is necessarily reactive. It will be noted in particular that diagnosis is made more complicated because only positive reports of faults are received from the user end - absence of a fault report from a given user, e.g. 4101 , does not necessarily indicate that there is no fault on the node 41 connecting that user to the network. The user may not have tried to use the connection recently, or not to the extent necessary for the fault to be apparent. In particular, the fault may be intermittent, or only evident when several users are trying connect through the same intermediate nodes. Thus, referring to Figure 1 , the presence of fault reports from terminals 4102, 41 1 1 and 41 12 may indicate a fault at one of the nodes 4, 41 serving all three terminals, or it may indicate two separate faults: one at the node 41 1 serving both nodes 41 1 1 and 41 12, and a second at node 4102 itself. Nothing can be deduced from the absence of a fault report from the node 4101 .
Figures 2, 3, 4 and 5 illustrate how the network condition database 55 and user fault logging system 6 both have incomplete views of the condition of the network and in particular the location of a fault. These Figures all depict a part of the network 1 , namely that part connected to one of the secondary nodes 20. The network conditioning monitoring system 5 monitors this part of the network through the respective primary node 2. Any faults reported by the users are logged in the user fault reporting system 6. In each of these examples, a fault is present on one or more of the secondary nodes 20, 201 , 202, but neither the network-end database 55 nor the user-facing end 6 has a sufficiently complete picture to determine exactly which node or nodes is or are the problem. In these figures, solid triangles represent end-user terminals in respect of which faults have been reported to the UFRL 6, and solid circles represent the node or nodes that require attention.
As seen in Figure 2, end-users 201 1 , 2013 and 2021 all report a connection fault to the UFRL 6. As their first common point of connection to the network is the node 20, this pattern of faults suggests a fault at that node, and a reactive response system would initiate a site visit to investigate. Note that, as the fault reporting system at the customer end is reactive, the system can only deduce the presence of faults from positive reports - the absence of fault reports from other end-users 2012, 2022, connected through the node 20 is not evidence that the faults are more localised: the respective users may not have observed the fault, for example because they are not using a service on which the fault is evident.
However, as seen in Figure 3, the same pattern of user fault reports (201 1 , 2013, 2021 ) may be caused by two separate faults deeper into the network (closer to the end users), in this example at nodes 201 and 202. In this case, a site visit to the common point of connection at 20 would be waste of time and resources, as the faults are elsewhere.
Such ambiguities can be resolved, in this embodiment, by matching the fault reports logged at the UFRL 6 in respect of the user terminations 201 1 , 2013, 2021 with the view of the network as seen from the network condition database 55. In the situation depicted in Figure 2, the network condition database 55 has, as a result of routine probes of the network, discovered suboptimal performance at the node 20, which confirms that the faults reported at 201 1 , 2013, 2021 have a common cause. Conversely, in the situation depicted in Figure 3, no fault is identified at the common node 20, indicating that the faults reported to the UFRL 6 do not have a common cause. As potential faults have been logged at both nodes 201 and 202, it is those which should be attended to.
Comparing now Figures 4 and 5, the network condition database 55 has transmitted a probe message down the connection 222 to the node 20 serving the part of the network under discussion, from which a negative or errored response 50 (Figures 1 , 6) has been received, but the nature of the fault is such that its precise location cannot be determined from the network end. By matching the error to customer reports it can be possible to identify the affected node more precisely. In Figure 5, fault reports are only arriving at the UFRL 6 from user terminals connected to one of the secondary nodes - node 201 - which is indicative that it is likely the fault is in that node, whereas in Figure 4 fault reports are arriving from terminations connected to more than one of the secondary nodes, making it more likely that the fault is in the primary node 20.
It will be appreciated that the examples given here are very much simpler than real-life situations, which may involve very much more complex network topographies such as rings, duplicate paths, etc.
As depicted in Figure 1 , this embodiment operates using a pattern matching system 7 to compare fault reports logged by the user fault report logging system 6 with the current condition status of the various network elements stored in the network condition database 55, by reference to a network model 8. The pattern matching system provides an output 90 to a task management system 9 for allocating tasks to maintenance and repair personnel.
Figure 6 is a schematic diagram illustrating the various information flows taking place in the embodiment. Figure 7 is a schematic diagram illustrating task allocation process using fault identification process of Figure 6. The process consists of the following steps:
Routine tests 50 carried out on the network 2 are used by the network condition monitor 5 to identify suspect lines, which are reported (step 51 ) to the network condition database 55. These suspect lines are identified (step 5) from an algorithm that uses conditions and a scoring method to grade its electrical health. When a live fault report 60 is delivered by the user fault report logger system 6, it is matched (77) to a prioritised list of existing network problems (faults, defects and suspect lines from overnight routine testing) stored in the network condition database 55. By combining data from overnight tests, fault volume reduction and network inventory systems it is possible to identify the cable sections deemed at a high risk of failure. The system then matches the cable sections to the first in-bound fault along the same route.
The potentially relevant parts of the network 2 are identified from the network model 8 (step 80) and the fault report 60 is compared with data 52 from the network condition database 55 relating to those parts of the network 2 to identify the most likely location of the cause of the faults reported by the users.
These faults are then prioritised (71 ), for example by severity, number of end-users affected, availability of backup capability, or safety-critical functions, and a list of fault rectification tasks is generated (72) which are reported (70) to a task management system 9. The task management system 9 co-ordinates data relating to equipment which requires investigation (92), reported faults (93) and defects identified from network tests (94) to generate instructions 90 to rectify the faults, along with any related network management tasks 91 which may conveniently be done at the same time. Such management tasks 91 would typically be investigation of impaired performance which has not yet caused any customer fault reports, routine maintenance, etc. These instructions are then transmitted to the technical staff through an interface 95. The technicians, having carried out the repair work specified in the fault rectification instruction 90, can perform any related survey tasks 91 and, if necessary, report on the network condition at that location and identify any further work that may be necessary (96) which is use to update the network condition database 55.
This embodiment can therefore be used to allow technicians to identify and undertake fault volume reduction work whilst attending a live fault.
A technician then receives either a fault volume reduction (FVR) task (90) or a survey task (91 ). The technician will fix the end-user fault as a priority then carries out the FVR or survey task. The integration of user fault report logging and network condition data allows information to be presented to the technician so that - lo ne can repair both the immediate fault and the root cause of failure in the network in a single visit. The technician can therefore be presented with all of the test, network inventory and defect information to fix the fault (reactive part). The technician can survey and address the underlying root-cause and, where appropriate, proactively carry out any required fault volume reduction work.
The embodiment has been described in relation to a telecommunications network, but the principle may be applied in other utility-based industries where inventory data and historical data can be used, for example leak detection and repair in oil, gas and water applications, or in highway maintenance regime of potholes, by comparing real-time monitoring feeds with in-bound defect reports, to then conduct resurfacing activities.

Claims

1 . Apparatus for diagnosing fault conditions in a distributed system having a network condition monitoring system for receiving measurement data from a plurality of reporting points in the distributed system, a network condition database, and a performance testing system for retrieving performance data from the network condition monitor to be stored in the network condition database, further comprising a fault logging system for reporting faults relating to a plurality of termination points connected to, but remote from, the reporting points, and a correlation system for comparing faults reported by the fault logging system with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with one or more of the fault reports.
Apparatus according to Claim 1 , further comprising a task management system for generating instructions to attend to a reported fault identified by the fault logging system, the task management system being further arranged to identify from the network condition database any related performance-related tasks to be attended to together with the reported fault.
Apparatus according to Claim 1 or Claim 2, further comprising a database of relationships between individual elements of the distributed system, wherein the correlation system identifies the relatedness of fault reports and network conditions, and the identification of related tasks, by reference to the database.
Apparatus according to Claim 3, wherein the database stores a record of the topology of the distributed system, and of connections between the individual elements Apparatus according to Claim 1 , Claim 2, Claim 3 or Claim 4, further comprising a test probing system for periodically interrogating the reporting points for network condition reports.
A method for diagnosing fault conditions in a distributed system, wherein network condition reports are collected in a network condition database by a network condition monitor taking measurements at a plurality of respective reporting points, and reports of faults relating to a plurality of termination points connected to, but remote from, the reporting points are recorded in a fault logging system, wherein faults recorded by the fault logging system are correlated with data stored in the network condition database to identify elements in the distributed system whose condition can be associated with the fault reports.
A method according to Claim 6, wherein instructions to attend to the faults identified by the fault logging system are generated, and wherein performance-related tasks are identified from the network condition database for attention together with rectification of the reported faults are identified from the network condition database.
A method according to Claim 6 or Claim 7 wherein the relatedness of fault reports and network conditions, and the identification of related tasks, are identified by reference to a database of relationships between individual elements of the distributed system.
A method according to Claim 8, wherein the database stores a record of the topology of the distributed system, and of connections between the individual elements 10. A method according to Claim 6, Claim 7, Claim 8 or Claim 9, wherein the reporting points are periodically interrogated for network condition reports.
PCT/EP2016/071722 2015-09-25 2016-09-14 Fault diagnosis WO2017050621A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1517018.6 2015-09-25
GB1517018.6A GB2542610B (en) 2015-09-25 2015-09-25 Fault diagnosis

Publications (1)

Publication Number Publication Date
WO2017050621A1 true WO2017050621A1 (en) 2017-03-30

Family

ID=54544133

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2016/071722 WO2017050621A1 (en) 2015-09-25 2016-09-14 Fault diagnosis

Country Status (2)

Country Link
GB (1) GB2542610B (en)
WO (1) WO2017050621A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019012255A1 (en) * 2017-07-11 2019-01-17 Spatialbuzz Limited Fault monitoring in a utility supply network
GB2542610B (en) * 2015-09-25 2019-07-03 British Telecomm Fault diagnosis

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2597920A (en) * 2020-07-30 2022-02-16 Spatialbuzz Ltd Fault monitoring in a communications network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002010944A1 (en) * 2000-08-01 2002-02-07 Qwest Communications International Inc. Performance modeling, fault management and repair in a xdsl network
US20130227103A1 (en) * 2012-02-23 2013-08-29 Infosys Limited End-to-end network service assurance solution

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPN786896A0 (en) * 1996-02-02 1996-02-29 Telstra Corporation Limited A network fault system
US7752024B2 (en) * 2000-05-05 2010-07-06 Computer Associates Think, Inc. Systems and methods for constructing multi-layer topological models of computer networks
JP5217820B2 (en) * 2008-09-12 2013-06-19 富士通株式会社 Support program, support device, and support method
GB2542610B (en) * 2015-09-25 2019-07-03 British Telecomm Fault diagnosis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002010944A1 (en) * 2000-08-01 2002-02-07 Qwest Communications International Inc. Performance modeling, fault management and repair in a xdsl network
US20130227103A1 (en) * 2012-02-23 2013-08-29 Infosys Limited End-to-end network service assurance solution

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2542610B (en) * 2015-09-25 2019-07-03 British Telecomm Fault diagnosis
WO2019012255A1 (en) * 2017-07-11 2019-01-17 Spatialbuzz Limited Fault monitoring in a utility supply network
US11026108B2 (en) 2017-07-11 2021-06-01 Spatialbuzz Limited Fault monitoring in a utility supply network

Also Published As

Publication number Publication date
GB2542610B (en) 2019-07-03
GB2542610A (en) 2017-03-29
GB201517018D0 (en) 2015-11-11

Similar Documents

Publication Publication Date Title
US9571334B2 (en) Systems and methods for correlating alarms in a network
CN105703934B (en) A kind of PON network method for diagnosing faults and device towards home broadband business
US9331897B2 (en) Recovery from multiple faults in a communications network
US20080181099A1 (en) Methods, systems, and computer program products for using alarm data correlation to automatically analyze a network outage
US20110314331A1 (en) Automated test and repair method and apparatus applicable to complex, distributed systems
CN116345696B (en) Anomaly information analysis management system and method based on global monitoring
EP2645686A1 (en) Cable damage detection
CN102740112B (en) Method for controlling equipment polling based on video monitoring system
GB2542610B (en) Fault diagnosis
CN103023028B (en) A kind of electric network fault method for rapidly positioning based on inter-entity dependence graph
US6836798B1 (en) Network model reconciliation using state analysis
CN106506226B (en) A kind of starting method and device of fault detection
JP6812312B2 (en) Plant support evaluation system and plant support evaluation method
US10735099B2 (en) Method and apparatus for performing event-driven diagnostics or prognostics of a network behaviour of a hierarchical optical network
JP2015146088A (en) integrated maintenance management system
EP3231124B1 (en) Apparatus and method for inline monitoring of transmission signals
EP1257095A2 (en) Method and system for providing an efficient use of broadband network resources
CN110474821A (en) Node failure detection method and device
US7583605B2 (en) Method and system of evaluating survivability of ATM switches over SONET networks
CN110213093B (en) Method and system for reporting node fault auxiliary diagnosis based on airport internet of things
CN109905290B (en) System and method for automatic warning of shift preparation in different places and computer readable storage medium
Hata et al. Alarm correlation method using Bayesian network in telecommunications networks
CN117875636A (en) Task allocation method and device, electronic equipment and storage medium
US20230216753A1 (en) Noise and impairment localization
Dialynas et al. Implementation of reliability centered maintenance programs into electric power systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16769922

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16769922

Country of ref document: EP

Kind code of ref document: A1