WO2017059904A1 - Anomaly detection in a data packet access network - Google Patents

Anomaly detection in a data packet access network Download PDF

Info

Publication number
WO2017059904A1
WO2017059904A1 PCT/EP2015/073186 EP2015073186W WO2017059904A1 WO 2017059904 A1 WO2017059904 A1 WO 2017059904A1 EP 2015073186 W EP2015073186 W EP 2015073186W WO 2017059904 A1 WO2017059904 A1 WO 2017059904A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
network element
ports
status
determined
Prior art date
Application number
PCT/EP2015/073186
Other languages
French (fr)
Inventor
Octavian MANESCU
Voicu ALBU
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to PCT/EP2015/073186 priority Critical patent/WO2017059904A1/en
Publication of WO2017059904A1 publication Critical patent/WO2017059904A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0604Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0681Configuration of triggering conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • H04L43/067Generation of reports using time frame reporting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning

Definitions

  • the present invention relates to a method for detecting a fault in a network element located in a data packet access network and relates to the corresponding fault detector. Furthermore, a computer program comprising program code and a carrier comprising the computer program are provided.
  • RTUs Remote Test Units
  • Alarms are generated by processing the results of the measurements. If a fault at the network element in the access network should be detected without RTUs, it is known to use artificial intelligence algorithms in order to obtain an increased diagnostic accuracy. Faults in a network element of the access network may be detected without hardware probes using xDSL access network characteristics.
  • WO 2013/144539 A discloses a solution which reads the xDSL status ports from an Element Manager System, EMS. Furthermore, it is possible to detect a fault using xDSL parameters such as the number of erroneous seconds, wherein the type of fault is identified using statistical algorithms such as regression.
  • the alarm is raised based on a parameter associated with a port collected by an Element Manager System (EMS) or by a RADIUS protocol which in the case of large access networks can lead to the overload of EMS or RADIUS.
  • EMS Element Manager System
  • RADIUS Remote Authentication Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Dial Diality
  • a method for detecting a fault in a network element is provided, the network element being located in a data access network.
  • status parameters of a plurality of ports of the network element are determined and an evolution in time of the status parameters is determined.
  • a status indicator OSR is determined and a temporal evolution of the status indicator is determined based on the determined status parameters, wherein the status indicator represents information of aggregated status parameters of several ports of the network element.
  • a self-adaptive statistical model is determined based on the determined status indicator OSR and an outlier in the self- adaptive statistical model are determined and a fault in the network element is detected based on the determined outlier.
  • the proposed solution is based on existing resources in the data packet access network and there is no need for additional equipment such as measuring probes or RTUs.
  • additional equipment such as measuring probes or RTUs.
  • the self-adaptive statistical model which self-adapts based on storage and status parameter values furthermore provides a better detection of a fault, as not a predefined and constant model is used.
  • the status parameters are read directly from the network element and not via EMS, so no overload of EMS of the RADIUS protocol occurs.
  • the corresponding fault detector comprising at least one processing unit configured to inter alia carry out the steps mentioned above.
  • a computer program is provided comprising program code to be executed when at least one processing unit is provided, wherein execution of the program code causes the at least one processing unit to execute the method mentioned above.
  • a carrier comprising the computer program is provided.
  • Fig. 1 is a flowchart carried out by a fault detector in order to detect a fault at a network element located in a data packet access network.
  • Fig. 2 is a flowchart of a method carried out by a fault detector for detecting a fault in a network element according to a further embodiment.
  • Fig. 3 illustrates an architecture of a system in which a fault detector detects a fault of a network element located in an IP access network.
  • Fig. 4 indicates a time series of a status indicator determined based on status parameters of the ports of the network element located in the access network.
  • Fig. 5 shows a time series of a z score determined from the data shown in Fig. 4.
  • Fig. 6 shows a time series of a model parameter ⁇ Z of a self-adaptive statistical model used to determine a fault of the network element.
  • Fig. 7 shows a time series of a statistical model used to detect the fault, the statistical model providing an upper threshold and a lower threshold used to determine if a fault at the network element is detected and an alarm is generated.
  • Fig. 8 shows a schematic architectural view of a fault detector used to detect fault at the network element.
  • Fig. 9 shows a schematic view of how the ports of the network element are grouped into different groups, each port having a certain port status.
  • Fig. 10 shows a schematic architectural overview of a system including the network element such as a DSLAM and a fault detector. Detailed Description
  • a technique which is based on existing resources, such as status parameters of ports of the network element is described in further detail, especially a technique which is based on existing resources, such as status parameters of ports of the network element.
  • the fault is detected in a network element such as a DSLAM (Digital Subscriber Line Access Multiplexer) which connects multiple Digital Subscriber Line (DSL) interfaces to a digital communication channel using multiplexing.
  • DSLAM Digital Subscriber Line Access Multiplexer
  • DSL Digital Subscriber Line
  • the proposed solution is based on existing resources in the access network and does not require investment in additional equipment, such as measuring probes.
  • the present invention uses the status of the ports of the network element, wherein the ports are used to connect the different DSL interfaces to the communication channel as mentioned above.
  • the port status is used as a source of data for detecting anomalies at the network element/DSLAM.
  • the method starts from a characteristic of the access network, such as the xDSL access network, i.e. a port status, without using hardware test probes.
  • a characteristic of the access network such as the xDSL access network, i.e. a port status
  • the reading of the port status is made directly from the network element without using other elements inbetween from which the port status may be obtained, such as elements of management (EMS, Element Management System) or platform authentication using a RADIUS protocol.
  • EMS elements of management
  • RADIUS Radio Authentication Diality
  • This approach avoids and eliminates the overload problem of EMS and RADIUS and also identifies automatically from the beginning the status of the network elements.
  • a generation and clearing of alarms as a result of a detected fault is done automatically using a self-adaptive statistical model based on a pattern of a behavior over time of the status of the ports.
  • the model is adaptive with a dynamic alarm threshold and varies over time and is specific to each network element and differs from one network element to another network element.
  • the method can furthermore take into account information from external resources, such as information about the network inventory in which the network element is located or information about known network interruptions due to maintenance work etc.
  • the different ports and port status parameters may be grouped into different groups, as different ports of the network element are connected to cables, such as copper cables, wherein the copper cables are divided into groups and each group has several pairs of wire, wherein a pair of wire can be connected to one port, so that the port is thus connected or associated with a group.
  • the faults at the network element can then be categorized taking into account a status parameter of all ports belonging to the same group.
  • Fig. 10 shows an architectural overview of a telecommunication system in which a network element such as a DSLAM 210 is provided to collect the data from different CPEs (customer premises equipment) 310. The aggregated traffic is then directed via a router or switch 500 to a backbone switch of the telecommunications system and to the internet. Depending on the role of the different CPEs 310 the data handled by DSLAM 210 can include voice data and other data packets such as internet data. As explained in further detail in connection with Fig. 9 each DSLAM comprises a plurality of ports 215 used for transmission, reception and forwarding of data packets.
  • a fault detector 100 is provided which will detect faults and errors of the ports 215 and can generate alarms for the network element 210.
  • the fault detector 100 may be part of the DSLAM or a separate unit connected to the DSLAM, and may specifically be a fault detector 100 as described below in connection with Fig. 8.
  • Fig. 3 provides a more detailed overview in which the fault detector 100 can detect faults and errors of the network element 210 provided in data packet access network 200.
  • the access network is connected to data sources 300, wherein the data source can comprise customer premises equipment (CPE) 310 of Fig. 10 and DSL terminations.
  • CPE customer premises equipment
  • the data received from the network sources or requested from the network sources can be transmitted over the access network 200 to a wide area network such as the Internet not shown in Fig. 3.
  • the fault detector may use information from external services 400. Referring also to fig. 2, a high level summary of the steps carried out by the fault detector 100 is given.
  • a first step S21 the status parameters of the ports are determined.
  • the link status of a port may be down, meaning that the link is down, or may be up, meaning that the link is enabled and ready to send data packets.
  • other status parameters such as dormant or not connected, are known.
  • the status parameters of all administered ports i.e. the ports which are configured for data service delivery, are taken into account and a port status up or port status down will be used to determine a fault at the network element.
  • the status parameter and its behavior over time will form the basis for a self-adaptive statistical model that is generated in step S22 using the time evolution of the status parameters of the ports.
  • the model is specific to each network element and can be used as a system that accepts as input source an aggregated indicator, a status indicator of the status parameters of the ports for the monitored DSLAM. If certain conditions based on the statistical model are met, an alarm is triggered or an alarm is cleared in step S23.
  • the alarms are enriched with information from the external data 400 shown in fig. 3, such as the network inventory, and then the alarm is filtered in order to filter out alarms for which the reason is already known, such as an interruption in the access network for administration or maintenance work.
  • the CPE to which the data packets are transmitted or from where the data packets are received may have been turned down due to the shutdown of the electrical network, e.g. due to work scheduled in the electrical network by the network supplier or by faults in the electrical network.
  • the client operating the CPE may have simply turned off the power of the CPE.
  • a change of a status parameter may also originate from faults of some equipment cards at the DSLAM site, be it a hardware or software fault.
  • the copper cables connected to the ports may be cut or may even be stolen.
  • a further reason for a change of the operating parameters may be work scheduled for replacement or the repair of the copper cables.
  • operating parameters of the copper cable may degrade resulting in an increased instability of the customer lines.
  • a further reason is the suspension of the service for the client due to non-payment of services.
  • step S25 may be carried out in which a damaged, a cut or a stolen cable is located using test measurements such as SELT (Single Ended Loop Testing) measurements which may be performed by a chipset on a DSLAM card. If the result of the measurement is that all cables for which an operating parameter down has been detected have the same length, the cable may be cut, stolen or damaged. If different length measurements result from the SELT measurement, it can be excluded that the cable is the reason for ports of the same group having the status parameter down.
  • SELT Single Ended Loop Testing
  • step S26 an analysis of the status parameters is carried out where the access network instability and location of these points of instability is analyzed.
  • information from a storage module 131 of Fig.3 reports are generated such as: - how often a port status changes in a given period of time; - which are cards / DSLAMs that have the largest number of ports that changes state in a certain period of time.
  • Fig. 3 provides a more detailed description of the method carried out by fault detector 100 in order to detect fault at network element 210.
  • a task scheduler 121 can periodically launch at regular intervals, such as 15 minutes, or any other interval a data gathering module 122.
  • step S2 the data gathering module reads the status parameters of the ports 215 shown in fig. 9.
  • the status parameters may be read using multi-threaded SNMP (Simple Network Monitoring Protocol) queries launched through the IP network to the access network.
  • SNMP Simple Network Monitoring Protocol
  • the result of the reading of the status parameters of the DSLAM is then returned through SNMP to the data gathering module 122 where a status parameter OSR is calculated.
  • the status parameter OSR is an Out of Sync Ratio determined for the ports in administration of the network element.
  • OSR is the ratio between the number of ports that have the operational status parameter down/out of sync and the number of ports having an operational status parameter configured for delivery of data packets.
  • ORS has values in the range between 0 and 1 , wherein the value is 0 if all ports are synchronized or have the status parameter up and the value is 1 if all ports have the status parameter down or are desynchronized. .
  • step S3 this status indicator OSR and its evolution over time is forwarded to statistical module 123 where an self-adaptive statistical model is generated.
  • an self-adaptive statistical model is generated.
  • the self-adaptive statistical model which describes a functional behavior pattern for the network element. This pattern is unique to each network element and changes over time.
  • step S4 the parameters of the statistical model are sent to an alarm module 124 which is a module configured to trigger and clear alarms.
  • the statistical model has time dependent thresholds, a time dependent upper threshold and a time dependent lower threshold. These thresholds are determined for different time intervals and a further parameter derived from the status indicator OSR is used to check whether this deduced parameter is above or below the threshold meaning that they are outside the thresholds.
  • step S5 the generated alarms are sent to the enrichment filtering module 125.
  • the generated alarms are enriched with information from external sources 400, such as the network inventory database 410 or the power alarm flows 420 or a database 430 informing about scheduled work performed in the network for different reasons, such as maintenance or repair.
  • Module 125 classifies the alarms and checks whether the reason for the alarm can be explained with the information provided from the external resources 400.
  • alarms generated by ports for reasons of scheduled work or lack of electric power in a specific area are classified as informative, but the causes of those alarms are treated in other monitoring systems and are not part of the present invention.
  • the fault detector is interested in alarms that remain after filtering out the alarms mentioned above for which the reason can be explained with other reasons and for which the reason is known. Furthermore, a categorization of the alarms can be carried out in the enrichment filtering module 125. As will be explained in connection with fig. 9 further below, depending on the number of ports having an operational status down compared to the number of ports of a group or all ports of a network element, the level of the alarm may be classified to a low level, a medium level or a high level alarm.
  • the alarms which were enriched and filtered are then stored in step S7 in storage module 131 and in step S8 the information about the alarms can be displayed in a display or presentation module 126.
  • step S9 the information is provided to module 127 where module 127 tries to locate the reason of the alarm such as damaged or stolen cables to which the different ports are connected.
  • Module 127 can select ports having an operational status parameter down and can initiate measurements which can help to locate the error.
  • commands for measuring Time Domain Reflectometry (TDR) can be generated and transmitted to the access network 200 where in step S10 TDR measurements are carried out on the selected ports having the operational status parameter down.
  • TDR Time Domain Reflectometry
  • These measurements can include the single ended loop test measurements.
  • the measurement results can also be stored in a storage module in step S1 1 and may be transmitted to the analysis module 128 in step S12 which can then use the data from the storage module for performance analysis of ports, cards, the whole network element.
  • the statistical module is determined based on the status parameters of the ports.
  • the operational status parameters of the ports of a network element are read and a status parameter OSR is determined which describes the evolution in time of the functional pattern of the network element to which the ports belong.
  • the OSR parameter is the ratio between the number of ports that have the operational status down and the number of ports taken in administration, i.e. the ports configured for data service delivery.
  • Fig. 4 shows an example of a time evolution of the OSR value.
  • OSR has values in the range from 0 to 1 and is a time series of discrete points, as the data are collected at regular time intervals. In the example shown every 15 minutes OSR is determined.
  • a daily periodicity of the graph exists which is mainly due to the customers ' behavior from a particular geographic area, namely the customers that are connected to the network element/DSLAM for which the data were collected.
  • the graph 40 in fig. 4 has a certain periodicity but isolated points such as points 41 or 42 exist which could be interpreted as anomalies.
  • xi, X2,...x n are the OSR calculated values for n time intervals.
  • the OSR values may be stored in a buffer of length n and they are continually recomputed as new data becomes available by progressively dropping the oldest value and by adding the latest value.
  • Zscore k (5) wherein the Z score calculated for sample Xk is the Z score for the OSR value at the k th time interval. The Z score allows taking any given sample within a set of data and allows to determine how many standard deviations above or below the mean the sample is.
  • AZscore k Zscore k — Zscore k _ t (6) with AZscore being the difference or delta between two successive values of z score. For the different time interval different AZscores can be calculated by
  • AZscore ⁇ AZscore 1 ,AZscore 2 ,...,AZscore n _ 1 ⁇ (7)
  • a window function W which takes the discrete values inside a chosen interval and which sets all the values outside the window to zero as shown by the following equation: 1, 1 ⁇ n ⁇ L
  • W will be used as a sliding window over the different AZscore values of equation (7) and for this subset values inside the windows the average and standard deviation is calculated. Then the subset is modified by "shifting forward", i.e. by excluding the first number of the series and including the next number following the original subset in the series. This creates a new subset of numbers for which the average and standard deviation are calculated. This process is repeated over the entire time series by
  • the window moves from data point to data point so that consecutive windows overlap.
  • Equation (9) is an operation of convolution and the statistical parameters calculated by the sliding window method as discussed above are called moving average and moving standard deviation denoted mx k and mo k . Based on these parameters an upper threshold and a lower threshold are calculated as given by equations (10) and (1 1 ) below.
  • fig. 4 shows the aggregated indicator OSR as graph 40. Furthermore, the anomalies 41 , 42 can be detected. As mentioned above, the status indicator OSR has a periodicity which depends on the behavior of customers connected to the network element for which OSR was calculated.
  • Fig. 5 now shows a graph 50 representing the Z score as calculated by equation (5) above. It can deduced from fig. 5 that most of the points of the graph have values between -2o n and +2o n . However, in the graph isolated points 51 or 52 are present which could be anomalies.
  • Fig. 6 shows the evolution of AZscore in graph 60 determined by equation (6) over time. As AZscore is a difference, most of the points have a value close to 0. However, the graph 60 furthermore has outliers, such as 61 , 62 and 63, 64. The points above the time axis of fig.
  • points 61 and 63 for a specific moment in time are associated with the phenomenon of desynchronization of a number of DSL ports compared to the previous time interval.
  • OSR, Zscore and AZscore is determined the values above 0 of fig. 6 indicate that a greater number of ports has a status parameter down compared to a previous time interval.
  • the points with the negative values such as points 62 and 64 for a specific moment in time, are associated with the phenomenon of synchronization for a number of ports relative to a previous time interval.
  • points such as points 62 and 64 of fig. 6 that a lower number of ports has the operational status parameter down compared to the number of ports. This can mean that a problem has been solved for a number of ports so that these points could trigger the clearing of alarms.
  • Fig. 7 shows the thresholds such as the upper border and the lower border as calculated by equations (10) and (1 1 ) together with the graph 60 indicating ⁇ . If an isolated point such as point 61 is greater than the upper threshold, the alarm is triggered. In a similar way if an outlier is lower than the lower threshold, the alarm can be cleared. In the example given an alarm is triggered at time 15:00:48 and canceled on the same day at 16:00:43, as the two points or outliers are located outside the thresholds. The points that are located within the upper and the lower threshold are considered part of the usual functional pattern of DSLAM. As can be seen, the thresholds 71 and 72 change over time with the behavior patterns of the network element.
  • the thresholds depend on time and are specific for each network element, as they are generated by the auto-adaptive statistical model as discussed above by taking into account previously collected OSR values in the window.
  • the triggered alarms such as alarms triggered by point 61 or 63, can be determined in the triggering or clearing alarm module 124 discussed above in connection with fig. 3.
  • the generation of the third adaptive statistical model may be carried out in module 123 discussed above in connection with fig. 3.
  • Fig. 9 shows a schematic view of the network element comprising different ports 215.
  • the port group refers to the passive access network structure, wherein the copper cables used to connect the ports to the wide area network are divided into different groups and each group has several pairs of wires, such as hundred pairs of wires.
  • a pair of wire can be connected to a single port 215 and the port connected in this way is associated with the corresponding group 218 to which the corresponding wire pair is connected.
  • a group port may comprise ports of a single network element, such as group 218a to 218c, 218e and 218f or 218g, but a group can also be distributed across multiple network elements as it is the case for group 218d.
  • the OSR value is determined for a single network element 210.
  • Fig. 9 furthermore shows the operational status parameter of a port.
  • the fully shaded circle indicates a port having a status parameter of down, such as port 215a, whereas the non- shaded circles such as port 215b have an operational status parameter of up.
  • the different port groups may have a different number of ports with a status parameter of down.
  • port 218e comprises in the embodiment shown six ports which all have the status parameter down, whereas for port group 218b one port has the status parameter up, whereas all other ports have the status parameter down. Based on the number of ports in each group having a status parameter down different fault categories can be determined for the network element 210.
  • a first fault category may be assigned to the situation.
  • severe faults such as a missing cable can be excluded as in each port group at least one port has a status parameter other than down.
  • this category in which all port groups have at least one port with status up is a more or less fault-free situation, it is named as first fault category.
  • a more severe fault category is the second fault category where at least one group from the port groups exists which does not have any port with a status up. This is the case for the upper network element 210 shown in fig. 9, as port group 218e has all ports with status parameter down.
  • a third more severe category would be that no group exists with a port having a status parameter up.
  • One port group 218e has all ports with status parameter down. However, it cannot be said with certainty that is a cable fault or missing cable is the reason for the detected behaviour.
  • SELT measurements are launched for at least two ports of this group 218e. The SELT measurements for a single port can take till 4 minutes and are carried out by a SELT chipset from a DSLAM card. If the result of the measurements is a result that the measurements provide a same length, it can be deduced that the cable of the group 218e is cut, stolen or damaged.
  • a cable fault or missing cable can occur for groups with all the ports having the status parameter down.
  • Fig. 1 summarizes the steps carried out by fault detector 100 to detect a fault at a network element such as network element 210.
  • the method starts in step S10 and in step S1 1 the status parameters are determined for the plurality of ports such as all ports of one network element. Furthermore, the evolution in time of the status parameters is monitored. Based on the monitored status parameters the status indicator OSR, the out of sync ratio, is determined in step S12. As indicated by equations (1 ) and (2) above, the OSR values are determined for different time intervals resulting in the graph of fig. 4. Based on the status parameter a self- adaptive statistical model with a time dependent upper threshold and a time dependent lower threshold is determined in step S13, wherein the model includes the AZscore as shown in figs.
  • step S15 a fault can then be detected based on the outlier.
  • the outlier may trigger an alarm and the alarm is enriched with information to see whether known reasons exist for the alarm.
  • a filtering can be carried out, such as the filtering in order to determine the group behavior of the group to which the different ports belong. Based on the filtering different fault categories may be determined as discussed above. The method ends in step S16.
  • Fig. 8 shows a schematic view of a fault detector 100, the fault detector comprising an input/output unit 1 10 comprising a transmitter 1 1 1 and a receiver 1 12.
  • the input/output unit 1 10 represents the possibility of the fault detector to transmit control messages or user data to other entities, the receiver representing the possibility to receive control messages or user data from other entities.
  • the input/output 1 10 unit may be used inter alia to receive the different status parameters of the ports which are then used for the determination of the self-adaptive statistical model.
  • a processing unit 160 is provided comprising one or more processors and which is responsible for the operation of the fault detector as discussed above.
  • the processing unit 160 can generate the commands that are needed to carry out the procedures of the fault detector discussed above or further below in which the fault detector is involved.
  • a memory 130 can be provided which can store a suitable program code to be executed by the processing unit 160.
  • a display 140 can display the information such as the different fault categories or the status of the different ports, wherein a human-machine- interface HMI 150 may be provided for the interaction between a user and the fault detector.
  • the different modules 121 to 128 shown in fig. 3 may be partly provided in processing unit 160 as processing modules, e.g. the task scheduler 121 or may be provided as software modules in memory 130 such as the module 123 for the determination of the statistical model. Furthermore, the different modules shown in fig.
  • Memory 130 can comprise different program modules which, when executed by the processing unit 160, cause the processing unit 160 to execute the corresponding method steps discussed above or in further detail below.
  • the fault detector includes one or more processing units and a memory 130 coupled to the processing unit.
  • the memory can include a read-only memory, a random access memory, a dynamic RAM or a static RAM, a mass storage or the like.
  • the memory can include various program code modules for causing the fault detector to perform operations as discussed above in connection with figs. 1 to 3 and as discussed in connection with figs. 4 to 7 from the determination of the self-adaptive statistical model.
  • the fault detector can comprise a processor and a memory, wherein the memory contains instructions executable by the processor, wherein the apparatus is operative to carry out the steps mentioned in connection with figs. 1 to 3 or mentioned in the context with figs.
  • a fault detector may be provided comprising different modules configured to perform the steps discussed above in connection with figs. 1 to 3 or discussed above in connection with figs. 4 to 7 for the generation of the self-adaptive statistical model.
  • the self-adaptive statistical model can comprise a time dependent model parameter ⁇ with a time dependent upper threshold and a time dependent lower threshold, wherein the fault is detected when the model parameter is outside the upper or lower threshold.
  • the fault detector With the time dependent upper and lower threshold, compared to fixed thresholds the fault detector can better react to the behavior of the customers which influence the status parameters of the ports as shown in fig. 3. As shown especially in connection with fig. 7, the adaptive thresholds can better reflect when an anomaly is present or not.
  • the alarm signal may be triggered or activated when the model parameter ⁇ is outside one of the upper or lower threshold, wherein the alarm signal may be deactivated again after being activated when the model parameter is outside the other of the upper or lower threshold.
  • the alarm was triggered when the model parameter was outside the upper threshold and was deactivated again when the model parameter is outside the lower threshold.
  • the time dependent model parameter ⁇ with the upper and lower threshold may be determined for different consecutive time intervals and the determination of the self-adaptive statistical model comprises determining z scores of the status indicator and comparing the Z scores of consecutive time intervals in order to determine the time dependent model parameter ⁇ .
  • the determination of the statistical model is based on the determination of a ratio between the number of ports of the network element having an operational status parameter down and the number of ports of the network element having an operational status parameter configured for delivery of the data packets or in other words the network elements taken in administration.
  • the different ports can be grouped into different port groups, such as port groups 218a to 218g shown in fig. 9. Furthermore, a fault category can be determined for a network element based on the operation status parameters of the ports in the different port groups of the network element.
  • Different fault categories can be determined describing how severe a fault is, based on the fact how many ports exist in each of the port groups which have the operational status parameter configured for delivery of data packets.
  • the determination of the fault category can inter alia comprise the following steps. It may be determined whether at least one port group exists in which at least one port has a status parameter other than down. Furthermore, it may be determined whether in each of the port groups at least one port exists which has a status parameter other than down. If a port group exists in which one has the status parameter up, it can be excluded that this port group to which the port with the status parameter up belongs, is a port group for which a stolen or missing cable can be determined.
  • a first fault category may be determined when each of the port groups of the network element has at least one port with the operational status parameter down.
  • a more severe fault category is determined when at least one port group of the network element exists in which all ports of the at least one port group have the operational status parameter down.
  • a still more severe fault category is determined when in all of the port groups of the network element all ports in the corresponding port groups have the operational status parameter down.
  • a cable connected to a defined number of ports and a cable fault or missing cable can be excluded when at least one port exists from the defined number of ports which has a status parameter other than down.
  • a cable fault can not be excluded, it is possible to determine a location of a fault such as a cable fault or missing cable based on test measurements carried out through the at least two of the defined number of ports which have an operational status parameter down.
  • information from the access network may be used to filter out known faults for which the cause is already known when an alarm is generated for the network element after having filtered out the known faults. This filtering helps to avoid that an alarm is generated for a behavior of the port status for which an explanation is already known.
  • the self-adaptive statistical model with the time dependent upper threshold and the time dependent lower threshold may be determined for each time interval based on values of the status indicator which are in the corresponding time interval.
  • the upper and the lower threshold is determined for each time interval based on historical status parameter values in accordance with the self-adaptive statistical model.
  • the solution provides a possibility to determine a fault of a network element based on known characteristics of a network element such as the port status.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to a method for detecting a fault at a network element (210) located in a data packet access network. Status parameters of a plurality of ports (215) of the network element (210) and an evolution in time of the status parameters are determined. Additionally, a status indicator (OSR) and a temporal evolution of the status indicator is determined based on the determined status parameters, the status indicator representing information of aggregated status parameters of several ports (215) of the network element. A self-adaptive statistical model is determined based on the determined status indicator (OSR). Furthermore, an outlier is determined in the self-adaptive statistical model, and the fault at the network element (210) is detected based on the determined outlier.

Description

Anomaly detection in a data packet access network Technical Field
The present invention relates to a method for detecting a fault in a network element located in a data packet access network and relates to the corresponding fault detector. Furthermore, a computer program comprising program code and a carrier comprising the computer program are provided.
Background
Traditional solutions for detecting faults or anomalies in access networks of a telecommunications network are based on hardware probes, such as Remote Test Units, RTUs, that allow electrical measurements such as measurements of the capacity, inductance or resistance. Alarms are generated by processing the results of the measurements. If a fault at the network element in the access network should be detected without RTUs, it is known to use artificial intelligence algorithms in order to obtain an increased diagnostic accuracy. Faults in a network element of the access network may be detected without hardware probes using xDSL access network characteristics.
WO 2013/144539 A discloses a solution which reads the xDSL status ports from an Element Manager System, EMS. Furthermore, it is possible to detect a fault using xDSL parameters such as the number of erroneous seconds, wherein the type of fault is identified using statistical algorithms such as regression.
The main problem occurring for solutions involving the use of RTUs lies in the higher costs that arise for large access networks in which several 10,000 cable groups or at least 1 .000.000 users may be located. These solutions are then used only in a limited area of the access network.
In the case of solutions without RTUs the following problems can occur:
The alarm is raised based on a parameter associated with a port collected by an Element Manager System (EMS) or by a RADIUS protocol which in the case of large access networks can lead to the overload of EMS or RADIUS. Furthermore, the solutions do not take into account the behavior pattern of a single network element, such as a DSLAM (Digital Subscriber Line Access Multiplexer). Furthermore, when a cable which is connected to the network element in the access network is damaged, cut or stolen, a measurement is initiated manually and the measurement does not locate the problem. Summary
Accordingly, a need exists to avoid at least some of the above-mentioned drawbacks and to provide a possibility for an improved fault detection in a network element located in a data packet access network.
This need is met by the features of the independent claims. Further aspects are described in the dependent claims.
According to a first aspect, a method for detecting a fault in a network element is provided, the network element being located in a data access network. According to one step, status parameters of a plurality of ports of the network element are determined and an evolution in time of the status parameters is determined. Additionally, a status indicator OSR is determined and a temporal evolution of the status indicator is determined based on the determined status parameters, wherein the status indicator represents information of aggregated status parameters of several ports of the network element. Furthermore, a self-adaptive statistical model is determined based on the determined status indicator OSR and an outlier in the self- adaptive statistical model are determined and a fault in the network element is detected based on the determined outlier. The proposed solution is based on existing resources in the data packet access network and there is no need for additional equipment such as measuring probes or RTUs. As the main data entry for the decision status parameters of the ports of the network element are used. The self-adaptive statistical model which self-adapts based on storage and status parameter values furthermore provides a better detection of a fault, as not a predefined and constant model is used. Furthermore, there is no need to manually adapt the model by an operator. Furthermore the status parameters are read directly from the network element and not via EMS, so no overload of EMS of the RADIUS protocol occurs.
Additionally, the corresponding fault detector is provided, the fault detector comprising at least one processing unit configured to inter alia carry out the steps mentioned above. Furthermore, a computer program is provided comprising program code to be executed when at least one processing unit is provided, wherein execution of the program code causes the at least one processing unit to execute the method mentioned above. Furthermore, a carrier comprising the computer program is provided.
Features mentioned above and features yet to be explained below may not only be used in isolation or in combination as explicitly indicated, but also in other combinations. Features and embodiments of the present application may be combined unless explicitly mentioned otherwise.
Brief Description of the Drawings
Various features of embodiments of the present invention will become apparent when read in conjunction with the accompanying drawings.
Fig. 1 is a flowchart carried out by a fault detector in order to detect a fault at a network element located in a data packet access network.
Fig. 2 is a flowchart of a method carried out by a fault detector for detecting a fault in a network element according to a further embodiment.
Fig. 3 illustrates an architecture of a system in which a fault detector detects a fault of a network element located in an IP access network. Fig. 4 indicates a time series of a status indicator determined based on status parameters of the ports of the network element located in the access network.
Fig. 5 shows a time series of a z score determined from the data shown in Fig. 4. Fig. 6 shows a time series of a model parameter Δ Z of a self-adaptive statistical model used to determine a fault of the network element.
Fig. 7 shows a time series of a statistical model used to detect the fault, the statistical model providing an upper threshold and a lower threshold used to determine if a fault at the network element is detected and an alarm is generated. Fig. 8 shows a schematic architectural view of a fault detector used to detect fault at the network element.
Fig. 9 shows a schematic view of how the ports of the network element are grouped into different groups, each port having a certain port status.
Fig. 10 shows a schematic architectural overview of a system including the network element such as a DSLAM and a fault detector. Detailed Description
In the following embodiments of the invention will be described in detail with reference to the accompanying drawings. It is to be understood that the following description of embodiments is not to be taken in a limiting sense. The scope of the invention is not intended to be limited by the embodiments described hereinafter, wherein the drawings are to be taken demonstratively only.
The drawings are to be regarded as being schematic representations and elements illustrated in the drawings are not necessarily shown to scale. Rather, the various elements are represented such that their function and general purpose becomes apparent for a person skilled in the art. Any connection or coupling between the functional blocks, devices, components or other physical or functional units shown in the drawings or described herein may also be implemented by an indirect connection or coupling. A coupling between components may be established over a wired or wireless connection. Functional blocks may be implemented in hardware, firmware, software or a combination thereof.
Hereinafter techniques for detecting a fault in a network element located in a data packet access network are described in further detail, especially a technique which is based on existing resources, such as status parameters of ports of the network element. The fault is detected in a network element such as a DSLAM (Digital Subscriber Line Access Multiplexer) which connects multiple Digital Subscriber Line (DSL) interfaces to a digital communication channel using multiplexing. The proposed solution is based on existing resources in the access network and does not require investment in additional equipment, such as measuring probes. As a main input the present invention uses the status of the ports of the network element, wherein the ports are used to connect the different DSL interfaces to the communication channel as mentioned above. The port status is used as a source of data for detecting anomalies at the network element/DSLAM. The method starts from a characteristic of the access network, such as the xDSL access network, i.e. a port status, without using hardware test probes. In this context it has to be mentioned that the reading of the port status is made directly from the network element without using other elements inbetween from which the port status may be obtained, such as elements of management (EMS, Element Management System) or platform authentication using a RADIUS protocol. This approach avoids and eliminates the overload problem of EMS and RADIUS and also identifies automatically from the beginning the status of the network elements. A generation and clearing of alarms as a result of a detected fault is done automatically using a self-adaptive statistical model based on a pattern of a behavior over time of the status of the ports. The model is adaptive with a dynamic alarm threshold and varies over time and is specific to each network element and differs from one network element to another network element. The method can furthermore take into account information from external resources, such as information about the network inventory in which the network element is located or information about known network interruptions due to maintenance work etc.
Furthermore, the different ports and port status parameters may be grouped into different groups, as different ports of the network element are connected to cables, such as copper cables, wherein the copper cables are divided into groups and each group has several pairs of wire, wherein a pair of wire can be connected to one port, so that the port is thus connected or associated with a group. The faults at the network element can then be categorized taking into account a status parameter of all ports belonging to the same group.
Fig. 10 shows an architectural overview of a telecommunication system in which a network element such as a DSLAM 210 is provided to collect the data from different CPEs (customer premises equipment) 310. The aggregated traffic is then directed via a router or switch 500 to a backbone switch of the telecommunications system and to the internet. Depending on the role of the different CPEs 310 the data handled by DSLAM 210 can include voice data and other data packets such as internet data. As explained in further detail in connection with Fig. 9 each DSLAM comprises a plurality of ports 215 used for transmission, reception and forwarding of data packets.
A fault detector 100 is provided which will detect faults and errors of the ports 215 and can generate alarms for the network element 210. The fault detector 100 may be part of the DSLAM or a separate unit connected to the DSLAM, and may specifically be a fault detector 100 as described below in connection with Fig. 8. Fig. 3 provides a more detailed overview in which the fault detector 100 can detect faults and errors of the network element 210 provided in data packet access network 200. The access network is connected to data sources 300, wherein the data source can comprise customer premises equipment (CPE) 310 of Fig. 10 and DSL terminations. The data received from the network sources or requested from the network sources can be transmitted over the access network 200 to a wide area network such as the Internet not shown in Fig. 3.
As will be explained in further detail below, the fault detector may use information from external services 400. Referring also to fig. 2, a high level summary of the steps carried out by the fault detector 100 is given. In a first step S21 the status parameters of the ports are determined. The link status of a port may be down, meaning that the link is down, or may be up, meaning that the link is enabled and ready to send data packets. Furthermore, other status parameters, such as dormant or not connected, are known. In the present application the status parameters of all administered ports, i.e. the ports which are configured for data service delivery, are taken into account and a port status up or port status down will be used to determine a fault at the network element. As will be explained in further detail below, the status parameter and its behavior over time will form the basis for a self-adaptive statistical model that is generated in step S22 using the time evolution of the status parameters of the ports. The model is specific to each network element and can be used as a system that accepts as input source an aggregated indicator, a status indicator of the status parameters of the ports for the monitored DSLAM. If certain conditions based on the statistical model are met, an alarm is triggered or an alarm is cleared in step S23. In step S24 the alarms are enriched with information from the external data 400 shown in fig. 3, such as the network inventory, and then the alarm is filtered in order to filter out alarms for which the reason is already known, such as an interruption in the access network for administration or maintenance work.
Different reasons exist that can lead to a change of the status parameters of the port. The CPE to which the data packets are transmitted or from where the data packets are received may have been turned down due to the shutdown of the electrical network, e.g. due to work scheduled in the electrical network by the network supplier or by faults in the electrical network. Furthermore, the client operating the CPE may have simply turned off the power of the CPE. A change of a status parameter may also originate from faults of some equipment cards at the DSLAM site, be it a hardware or software fault. Furthermore, it is possible that the copper cables connected to the ports may be cut or may even be stolen. A further reason for a change of the operating parameters may be work scheduled for replacement or the repair of the copper cables. Furthermore, operating parameters of the copper cable may degrade resulting in an increased instability of the customer lines. A further reason is the suspension of the service for the client due to non-payment of services.
For all the reasons above the status parameter and its change may occur unexpectedly and can be considered as different compared to the usual behavior of a status parameter. When an alarm is detected in step S24 for which the reason cannot be explained, step S25 may be carried out in which a damaged, a cut or a stolen cable is located using test measurements such as SELT (Single Ended Loop Testing) measurements which may be performed by a chipset on a DSLAM card. If the result of the measurement is that all cables for which an operating parameter down has been detected have the same length, the cable may be cut, stolen or damaged. If different length measurements result from the SELT measurement, it can be excluded that the cable is the reason for ports of the same group having the status parameter down. Furthermore, in step S26 an analysis of the status parameters is carried out where the access network instability and location of these points of instability is analyzed. Using information from a storage module 131 of Fig.3 reports are generated such as: - how often a port status changes in a given period of time; - which are cards / DSLAMs that have the largest number of ports that changes state in a certain period of time.
Fig. 3 provides a more detailed description of the method carried out by fault detector 100 in order to detect fault at network element 210. In step SI a task scheduler 121 can periodically launch at regular intervals, such as 15 minutes, or any other interval a data gathering module 122.
In step S2 the data gathering module reads the status parameters of the ports 215 shown in fig. 9. The status parameters may be read using multi-threaded SNMP (Simple Network Monitoring Protocol) queries launched through the IP network to the access network. By way of example in 15 minutes it is possible to read the status of more than one million ports and to go through all the steps S2 to S12 described below. As a consequence it is possible to say that the processing is near real-time.
The result of the reading of the status parameters of the DSLAM is then returned through SNMP to the data gathering module 122 where a status parameter OSR is calculated. The status parameter OSR is an Out of Sync Ratio determined for the ports in administration of the network element. OSR is the ratio between the number of ports that have the operational status parameter down/out of sync and the number of ports having an operational status parameter configured for delivery of data packets. ORS has values in the range between 0 and 1 , wherein the value is 0 if all ports are synchronized or have the status parameter up and the value is 1 if all ports have the status parameter down or are desynchronized. . In step S3 this status indicator OSR and its evolution over time is forwarded to statistical module 123 where an self-adaptive statistical model is generated. Based on the OSR parameters module 123 generates the self-adaptive statistical model which describes a functional behavior pattern for the network element. This pattern is unique to each network element and changes over time.
In step S4 the parameters of the statistical model are sent to an alarm module 124 which is a module configured to trigger and clear alarms. As will be described in further detail below, the statistical model has time dependent thresholds, a time dependent upper threshold and a time dependent lower threshold. These thresholds are determined for different time intervals and a further parameter derived from the status indicator OSR is used to check whether this deduced parameter is above or below the threshold meaning that they are outside the thresholds.
At each time interval these OSR derived parameters are compared to the thresholds and, if necessary, alarms are started or cancelled. The starting and the cancelling of the alarm and the determination of the OSR derived parameters are described in further detail below in connection with figs. 4 to 7.
In step S5 the generated alarms are sent to the enrichment filtering module 125. Here the generated alarms are enriched with information from external sources 400, such as the network inventory database 410 or the power alarm flows 420 or a database 430 informing about scheduled work performed in the network for different reasons, such as maintenance or repair. Module 125 classifies the alarms and checks whether the reason for the alarm can be explained with the information provided from the external resources 400. By way of example alarms generated by ports for reasons of scheduled work or lack of electric power in a specific area are classified as informative, but the causes of those alarms are treated in other monitoring systems and are not part of the present invention. In the present invention the fault detector is interested in alarms that remain after filtering out the alarms mentioned above for which the reason can be explained with other reasons and for which the reason is known. Furthermore, a categorization of the alarms can be carried out in the enrichment filtering module 125. As will be explained in connection with fig. 9 further below, depending on the number of ports having an operational status down compared to the number of ports of a group or all ports of a network element, the level of the alarm may be classified to a low level, a medium level or a high level alarm. The alarms which were enriched and filtered are then stored in step S7 in storage module 131 and in step S8 the information about the alarms can be displayed in a display or presentation module 126. Furthermore in step S9 the information is provided to module 127 where module 127 tries to locate the reason of the alarm such as damaged or stolen cables to which the different ports are connected. Module 127 can select ports having an operational status parameter down and can initiate measurements which can help to locate the error. To this end in step S10 commands for measuring Time Domain Reflectometry (TDR) can be generated and transmitted to the access network 200 where in step S10 TDR measurements are carried out on the selected ports having the operational status parameter down. These measurements can include the single ended loop test measurements. The measurement results can also be stored in a storage module in step S1 1 and may be transmitted to the analysis module 128 in step S12 which can then use the data from the storage module for performance analysis of ports, cards, the whole network element.
In the following it will be described in more detail how the statistical module is determined based on the status parameters of the ports. As discussed above in step S2, the operational status parameters of the ports of a network element are read and a status parameter OSR is determined which describes the evolution in time of the functional pattern of the network element to which the ports belong. The OSR parameter is the ratio between the number of ports that have the operational status down and the number of ports taken in administration, i.e. the ports configured for data service delivery.
Number of ports with operating parameter down
Number of ports configured for data delivery
Fig. 4 shows an example of a time evolution of the OSR value. As can be deduced from equation (1 ) and fig. 1 , OSR has values in the range from 0 to 1 and is a time series of discrete points, as the data are collected at regular time intervals. In the example shown every 15 minutes OSR is determined. From fig. 4 it can be deduced that a daily periodicity of the graph exists which is mainly due to the customers' behavior from a particular geographic area, namely the customers that are connected to the network element/DSLAM for which the data were collected. Thus, the graph 40 in fig. 4 has a certain periodicity but isolated points such as points 41 or 42 exist which could be interpreted as anomalies. One question which can be solved with the present invention is the following: It has to be decided whether the isolated points in the graph are anomalies and it has to be determined how the isolated points can be separated from the other points, the majority of points. The invention is based on a self-adaptive statistical model. First of all, some notations are introduced for the following discussion:
Figure imgf000012_0001
where xi, X2,...xn are the OSR calculated values for n time intervals. The OSR values may be stored in a buffer of length n and they are continually recomputed as new data becomes available by progressively dropping the oldest value and by adding the latest value.
Figure imgf000012_0002
n
the average of all OSR values calculated until the nth time interval.
Figure imgf000012_0003
is the standard deviation of all OSR calculated values until the nth time interval. Zscorek = (5) wherein the Z score calculated for sample Xk is the Z score for the OSR value at the kth time interval. The Z score allows taking any given sample within a set of data and allows to determine how many standard deviations above or below the mean the sample is.
AZscorek = Zscorek— Zscorek_t (6) with AZscore being the difference or delta between two successive values of z score. For the different time interval different AZscores can be calculated by
AZscore = {AZscore1,AZscore2,...,AZscoren_1} (7)
Furthermore, a window function W is defined which takes the discrete values inside a chosen interval and which sets all the values outside the window to zero as shown by the following equation: 1, 1 < n < L
W(n) = {
0, elsewhere
W will be used as a sliding window over the different AZscore values of equation (7) and for this subset values inside the windows the average and standard deviation is calculated. Then the subset is modified by "shifting forward", i.e. by excluding the first number of the series and including the next number following the original subset in the series. This creates a new subset of numbers for which the average and standard deviation are calculated. This process is repeated over the entire time series by
^ Azscor * W(n - 1) (9)
Thus, the window moves from data point to data point so that consecutive windows overlap.
Equation (9) is an operation of convolution and the statistical parameters calculated by the sliding window method as discussed above are called moving average and moving standard deviation denoted mxk and mok . Based on these parameters an upper threshold and a lower threshold are calculated as given by equations (10) and (1 1 ) below.
UpperBorderk = mxk + 2 * mok (10) Lower B or derk = mxk— 2 * mak (1 1 )
Using the definitions and notations described above, the self-adaptive statistical model determined as follows: yields yields yields
OSR » Z score » AZ score » (LowerBorder Upper Border)] (12)
The relationship of equation (12) will be illustrated in further detail in connection with figs. 4 to 7. As mentioned above, fig. 4 shows the aggregated indicator OSR as graph 40. Furthermore, the anomalies 41 , 42 can be detected. As mentioned above, the status indicator OSR has a periodicity which depends on the behavior of customers connected to the network element for which OSR was calculated.
Fig. 5 now shows a graph 50 representing the Z score as calculated by equation (5) above. It can deduced from fig. 5 that most of the points of the graph have values between -2on and +2on. However, in the graph isolated points 51 or 52 are present which could be anomalies. Fig. 6 shows the evolution of AZscore in graph 60 determined by equation (6) over time. As AZscore is a difference, most of the points have a value close to 0. However, the graph 60 furthermore has outliers, such as 61 , 62 and 63, 64. The points above the time axis of fig. 6, such as points 61 and 63 for a specific moment in time are associated with the phenomenon of desynchronization of a number of DSL ports compared to the previous time interval. The way OSR, Zscore and AZscore is determined the values above 0 of fig. 6 indicate that a greater number of ports has a status parameter down compared to a previous time interval. When several ports all of a sudden change the status parameter to down, one might deduce that these points can trigger alarms. The points with the negative values, such as points 62 and 64 for a specific moment in time, are associated with the phenomenon of synchronization for a number of ports relative to a previous time interval. Thus, it can be deduced from points such as points 62 and 64 of fig. 6 that a lower number of ports has the operational status parameter down compared to the number of ports. This can mean that a problem has been solved for a number of ports so that these points could trigger the clearing of alarms.
Fig. 7 shows the thresholds such as the upper border and the lower border as calculated by equations (10) and (1 1 ) together with the graph 60 indicating ΔΖ. If an isolated point such as point 61 is greater than the upper threshold, the alarm is triggered. In a similar way if an outlier is lower than the lower threshold, the alarm can be cleared. In the example given an alarm is triggered at time 15:00:48 and canceled on the same day at 16:00:43, as the two points or outliers are located outside the thresholds. The points that are located within the upper and the lower threshold are considered part of the usual functional pattern of DSLAM. As can be seen, the thresholds 71 and 72 change over time with the behavior patterns of the network element. The thresholds depend on time and are specific for each network element, as they are generated by the auto-adaptive statistical model as discussed above by taking into account previously collected OSR values in the window. The triggered alarms, such as alarms triggered by point 61 or 63, can be determined in the triggering or clearing alarm module 124 discussed above in connection with fig. 3. The generation of the third adaptive statistical model may be carried out in module 123 discussed above in connection with fig. 3.
Furthermore, the triggered alarms can be enriched with information from the external data sources 400 as mentioned above. Then they are filtered. One filtering option is the filtering based on ports belonging to the same group of cable distribution. Fig. 9 shows a schematic view of the network element comprising different ports 215. As shown in fig. 9, the different ports are grouped into different port groups 218a to 218f. The port group refers to the passive access network structure, wherein the copper cables used to connect the ports to the wide area network are divided into different groups and each group has several pairs of wires, such as hundred pairs of wires. A pair of wire can be connected to a single port 215 and the port connected in this way is associated with the corresponding group 218 to which the corresponding wire pair is connected. As shown in fig. 9, a group port may comprise ports of a single network element, such as group 218a to 218c, 218e and 218f or 218g, but a group can also be distributed across multiple network elements as it is the case for group 218d.
The OSR value is determined for a single network element 210. Fig. 9 furthermore shows the operational status parameter of a port. The fully shaded circle indicates a port having a status parameter of down, such as port 215a, whereas the non- shaded circles such as port 215b have an operational status parameter of up. The different port groups may have a different number of ports with a status parameter of down. By way of example port 218e comprises in the embodiment shown six ports which all have the status parameter down, whereas for port group 218b one port has the status parameter up, whereas all other ports have the status parameter down. Based on the number of ports in each group having a status parameter down different fault categories can be determined for the network element 210. By way of example if all the groups connected to the network element 210 have at least one synchronized port (status up) or other than down then a first fault category may be assigned to the situation. In this situation severe faults, such as a missing cable can be excluded as in each port group at least one port has a status parameter other than down. Applied to the example of fig. 9 this would mean that all the port groups 218a to 218g have at least one port with status up. This is rather not the case, as in the given example in group 218e all ports have a status parameter down. Even though this category in which all port groups have at least one port with status up is a more or less fault-free situation, it is named as first fault category. A more severe fault category is the second fault category where at least one group from the port groups exists which does not have any port with a status up. This is the case for the upper network element 210 shown in fig. 9, as port group 218e has all ports with status parameter down.
A third more severe category would be that no group exists with a port having a status parameter up.
In the situation shown in fig. 9 in the upper network element one cannot directly deduce the reason for the fault. One port group 218e has all ports with status parameter down. However, it cannot be said with certainty that is a cable fault or missing cable is the reason for the detected behaviour. To this end SELT measurements are launched for at least two ports of this group 218e. The SELT measurements for a single port can take till 4 minutes and are carried out by a SELT chipset from a DSLAM card. If the result of the measurements is a result that the measurements provide a same length, it can be deduced that the cable of the group 218e is cut, stolen or damaged.
A cable fault or missing cable can occur for groups with all the ports having the status parameter down.
Fig. 1 summarizes the steps carried out by fault detector 100 to detect a fault at a network element such as network element 210. The method starts in step S10 and in step S1 1 the status parameters are determined for the plurality of ports such as all ports of one network element. Furthermore, the evolution in time of the status parameters is monitored. Based on the monitored status parameters the status indicator OSR, the out of sync ratio, is determined in step S12. As indicated by equations (1 ) and (2) above, the OSR values are determined for different time intervals resulting in the graph of fig. 4. Based on the status parameter a self- adaptive statistical model with a time dependent upper threshold and a time dependent lower threshold is determined in step S13, wherein the model includes the AZscore as shown in figs. 6 and 7 with the upper and lower thresholds shown in fig. 7. When the self-adaptive statistical model is known together with the thresholds, it is possible to determine the outliers and the signal evolution in step S14, e.g. the outliers 61 to 64 shown in figs. 6 and 7. In step S15 a fault can then be detected based on the outlier. As described above, the outlier may trigger an alarm and the alarm is enriched with information to see whether known reasons exist for the alarm. Furthermore, a filtering can be carried out, such as the filtering in order to determine the group behavior of the group to which the different ports belong. Based on the filtering different fault categories may be determined as discussed above. The method ends in step S16.
Fig. 8 shows a schematic view of a fault detector 100, the fault detector comprising an input/output unit 1 10 comprising a transmitter 1 1 1 and a receiver 1 12. The input/output unit 1 10 represents the possibility of the fault detector to transmit control messages or user data to other entities, the receiver representing the possibility to receive control messages or user data from other entities. The input/output 1 10 unit may be used inter alia to receive the different status parameters of the ports which are then used for the determination of the self-adaptive statistical model. Furthermore, a processing unit 160 is provided comprising one or more processors and which is responsible for the operation of the fault detector as discussed above. The processing unit 160 can generate the commands that are needed to carry out the procedures of the fault detector discussed above or further below in which the fault detector is involved. A memory 130 can be provided which can store a suitable program code to be executed by the processing unit 160. A display 140 can display the information such as the different fault categories or the status of the different ports, wherein a human-machine- interface HMI 150 may be provided for the interaction between a user and the fault detector. The different modules 121 to 128 shown in fig. 3 may be partly provided in processing unit 160 as processing modules, e.g. the task scheduler 121 or may be provided as software modules in memory 130 such as the module 123 for the determination of the statistical model. Furthermore, the different modules shown in fig. 3 may be implemented by a combination of hardware and software partly stored or present in processing unit 160 and partly present in memory 130 as a module which, when processed by the processing unit 160, provides instructions needed to determine the fault as discussed above. Memory 130 can comprise different program modules which, when executed by the processing unit 160, cause the processing unit 160 to execute the corresponding method steps discussed above or in further detail below.
From fig. 8 it may be deduced that the fault detector includes one or more processing units and a memory 130 coupled to the processing unit. The memory can include a read-only memory, a random access memory, a dynamic RAM or a static RAM, a mass storage or the like. The memory can include various program code modules for causing the fault detector to perform operations as discussed above in connection with figs. 1 to 3 and as discussed in connection with figs. 4 to 7 from the determination of the self-adaptive statistical model. The fault detector can comprise a processor and a memory, wherein the memory contains instructions executable by the processor, wherein the apparatus is operative to carry out the steps mentioned in connection with figs. 1 to 3 or mentioned in the context with figs. 4 to 7 for the determination of the self-adaptive statistical model. Furthermore, a fault detector may be provided comprising different modules configured to perform the steps discussed above in connection with figs. 1 to 3 or discussed above in connection with figs. 4 to 7 for the generation of the self-adaptive statistical model. Furthermore from the above discussion some general conclusions can be drawn.
The self-adaptive statistical model can comprise a time dependent model parameter ΔΖ with a time dependent upper threshold and a time dependent lower threshold, wherein the fault is detected when the model parameter is outside the upper or lower threshold. With the time dependent upper and lower threshold, compared to fixed thresholds the fault detector can better react to the behavior of the customers which influence the status parameters of the ports as shown in fig. 3. As shown especially in connection with fig. 7, the adaptive thresholds can better reflect when an anomaly is present or not.
By way of example the alarm signal may be triggered or activated when the model parameter ΔΖ is outside one of the upper or lower threshold, wherein the alarm signal may be deactivated again after being activated when the model parameter is outside the other of the upper or lower threshold. In the example given above the alarm was triggered when the model parameter was outside the upper threshold and was deactivated again when the model parameter is outside the lower threshold. However, in dependence on the definition of the parameters involved the situation may be vice versa. Furthermore, the time dependent model parameter ΔΖ with the upper and lower threshold may be determined for different consecutive time intervals and the determination of the self-adaptive statistical model comprises determining z scores of the status indicator and comparing the Z scores of consecutive time intervals in order to determine the time dependent model parameter ΔΖ.
The determination of the statistical model is based on the determination of a ratio between the number of ports of the network element having an operational status parameter down and the number of ports of the network element having an operational status parameter configured for delivery of the data packets or in other words the network elements taken in administration.
Furthermore, the different ports can be grouped into different port groups, such as port groups 218a to 218g shown in fig. 9. Furthermore, a fault category can be determined for a network element based on the operation status parameters of the ports in the different port groups of the network element.
Different fault categories can be determined describing how severe a fault is, based on the fact how many ports exist in each of the port groups which have the operational status parameter configured for delivery of data packets. The determination of the fault category can inter alia comprise the following steps. It may be determined whether at least one port group exists in which at least one port has a status parameter other than down. Furthermore, it may be determined whether in each of the port groups at least one port exists which has a status parameter other than down. If a port group exists in which one has the status parameter up, it can be excluded that this port group to which the port with the status parameter up belongs, is a port group for which a stolen or missing cable can be determined. A first fault category may be determined when each of the port groups of the network element has at least one port with the operational status parameter down. A more severe fault category, however, is determined when at least one port group of the network element exists in which all ports of the at least one port group have the operational status parameter down.
A still more severe fault category is determined when in all of the port groups of the network element all ports in the corresponding port groups have the operational status parameter down.
As discussed in connection with fig. 9, a cable connected to a defined number of ports and a cable fault or missing cable can be excluded when at least one port exists from the defined number of ports which has a status parameter other than down. However, when a cable fault can not be excluded, it is possible to determine a location of a fault such as a cable fault or missing cable based on test measurements carried out through the at least two of the defined number of ports which have an operational status parameter down.
When a fault is determined for a network element, information from the access network may be used to filter out known faults for which the cause is already known when an alarm is generated for the network element after having filtered out the known faults. This filtering helps to avoid that an alarm is generated for a behavior of the port status for which an explanation is already known.
The self-adaptive statistical model with the time dependent upper threshold and the time dependent lower threshold may be determined for each time interval based on values of the status indicator which are in the corresponding time interval. The upper and the lower threshold is determined for each time interval based on historical status parameter values in accordance with the self-adaptive statistical model.
Summarizing, the solution provides a possibility to determine a fault of a network element based on known characteristics of a network element such as the port status.

Claims

Claims
1 . A method for detecting a fault at a network element (210) located in a data packet access network, comprising:
- determining status parameters of a plurality of ports (215) of the network element (210) and determining an evolution in time of the status parameters,
- determining a status indicator (OSR) and a temporal evolution of the status indicator based on the determined status parameters, the status indicator representing information of aggregated status parameters of several ports (215) of the network element,
- determining a self-adaptive statistical model based on the determined status indicator (OSR),
- determining an outlier in the self-adaptive statistical model,
- detecting the fault at the network element (210) based on the determined outlier.
2. The method according to claim 1 , wherein the self-adaptive statistical model comprises a time dependent model parameter ΔΖ with a time dependent upper threshold (71 ) and a time dependent lower threshold (72), wherein the fault is detected when the model parameter is outside the upper or lower threshold (71 , 72).
3. The method according to claim 2, wherein when the fault is detected an alarm signal is activated, wherein the alarm signal is activated when the model parameter ΔΖ is outside one of the upper and lower threshold (71 , 72), wherein the alarm signal is deactivated after being activated when the model parameter is outside the other of the upper and lower threshold (71 , 72).
4. The method according to claim 2 or 3, wherein the time dependent model parameter ΔΖ with the upper and lower threshold (71 , 72) is determined for different consecutive time intervals (80), wherein determining the self-adaptive statistical model comprises determining z scores of the status indicator (OSR) and comparing the z scores of consecutive time intervals (80) in order to determine the time dependent model parameter ΔΖ.
5. The method according to any of the preceding claims, wherein determining the status indicator (OSR) comprises determining a ratio between the number of ports (215a) of the network element having an operational status parameter down and the number of ports of the network element having an operational status parameter configured for delivery of data packets.
6. The method according to claim 5, wherein the plurality of ports (215) are grouped into different port groups (218a-g), wherein a fault category is determined for the network element based on the operational status parameters of the ports in the different port groups of the network element.
7. The method according to claim 6, wherein different fault categories are determined describing how severe the fault is based on the fact how many ports exist in each of the port groups of the network element which have the operational status parameter other than down.
8. The method according to claim 6 or 7, wherein determining the fault category comprises at least one of:
- determining whether at least one port group exists in which at least one port (215) has a status parameter other than down,
- determining whether in each of the port groups at least one port (215) exists which has a status parameter other than down.
9. The method according to any of claims 6 to 8, wherein a first fault category is determined when each of the port groups of the network element have at least one port with the operational status parameter other than down.
10. The method according to any of claims 6 to 9, wherein a more severe fault category is determined when at least one port group of the network element exists in which all ports of the at least one port group have the operational status parameter down.
1 1 . The method according to any of claims 6 to 10, wherein a still more severe fault category is determined when in all of the port groups of the network element all port in the
corresponding port groups have the operational status parameter down.
12. The method according to any of the preceding claims, wherein a cable is connected to a defined number of ports, wherein a cable fault or a missing cable is excluded when at least one port exists from the defined number of ports which has a status parameter other than down.
13. The method according to claim 12, further comprising the step of determining a location of a cable fault or missing cable based on test measurements carried out through at least two of the defined number of ports which have a status parameter down.
14. The method according to any of the preceding claims, wherein when the fault is determined for the network element (210), information from the access network (200) is used to filter out known faults for which the cause is already known, wherein an alarm is generated for the network element after the known faults were filtered out.
15. The method according to any of claims 4 to 14, wherein the self-adaptive statistical model with a time dependent upper threshold (71 ) and a time dependent lower threshold (72) is determined for each time interval (80) based on values of the status indicator (OSR)in the corresponding time interval (80).
16. The method according to any of claims 4 to 12, wherein the upper and lower threshold (71 , 72) is determined for each time interval (80) based on historical status parameters in accordance with the self-adaptive statistical model.
17. A fault detector (100) configured to detect a fault at a network element (210) located in a data packet access network (200), wherein the network element (210) comprises a plurality of ports (215), comprising:
- at least one processing unit (160) configured to
- determine status parameters of a plurality of ports (215) of the network element and determine an evolution in time of the status parameters,
- determine a status indicator (OSR) and a temporal evolution of the status indicator based on the determined status parameters, the status indicator representing information of aggregated status parameters of several ports (215) of the network element,
- determine a self-adaptive statistical model based on the determined status indicator, - determine an outlier in the self-adaptive statistical model, and
- detect the fault at the network element (210) based on the determined outlier.
18. The fault detector (100) according to claim 17, wherein the self-adaptive statistical model comprises a time dependent model parameter ΔΖ with a time dependent upper threshold (71 ) and a time dependent lower threshold (72), the at least one processing unit (160) being configured to detect the fault when the model parameter is outside the upper or lower threshold (71 , 72).
19. The fault detector (100) according to claim 18, wherein when the fault is detected the at least one processing unit (160) is configured to activate an alarm when the model parameter ΔΖ is outside one of the upper and lower threshold (71 , 72), wherein the at least one processing unit (160) is configured to deactivate the alarm signal after being activated when the model parameter is outside the other of the upper and lower threshold (71 , 72).
20. The fault detector(100) according to claim 18 or 19, wherein the at least one processing unit (160) is configured to determine the time dependent model parameter ΔΖ with the upper and lower threshold (71 , 72) for different consecutive time intervals (80), wherein the at least one processing unit (160), for determining the self-adaptive statistical model, is configured to determine z scores of the status indicator and to compare the z scores of consecutive time intervals (80) in order to determine the time dependent model parameter ΔΖ.
21. The fault detector (100) according to any of claims 15 to 17, wherein the at least one processing unit (160) is configured, for determining the status indicator, to determine a ratio between the number of ports (215a) of the network element having an operational status parameter down and the number of ports of the network element configured for delivery of data packets.
22. The fault detector (100) according to claim 18, wherein the at least one processing unit (160) is configured to group the plurality of ports (215) into different port groups (218 a-g) and to determine a fault category for the network element based on the operational status parameters of the ports in the different port groups of the network element.
23. The fault detector (100) according to claim 19, wherein the at least one processing unit (160), is configured to determine different fault categories describing how severe the fault is based on the fact how many ports exist in each of the port groups of the network element which have the operational status parameter other than down.
24. The fault detector (100) according to claim 20, wherein a cable is connected to a defined number of ports, wherein the at least one processing unit (160) is configured to exclude a cable fault or a missing cable when at least one port exists from the defined number of ports which has a status parameter other than down.
25. The fault detector (100) according to claim 21 , wherein the at least one processing unit (160) is configured to determine a location of a cable fault or missing cable based on test measurements carried out through at least two of the defined number of ports which have an operational status parameter down.
26. The fault detector (100) according to any of claims 17 to 25, wherein the at least one processing unit (160) is configured to use information from the access network (200) to filter out known faults for which the cause is already known, wherein the at least one processing unit (160) is configured to generate an alarm for the network element after the known faults were filtered out.
27. The fault detector according to any of claims 17 to 26, wherein the at least one processing unit is configured to determine the self-adaptive statistical model with a time dependent upper threshold (71 ) and a time dependent lower threshold (72) for each time interval (80) based on values of the status indicator (OSR)in the corresponding time interval (80).
28. The fault detector according to any of claims 17 to 27, wherein the at least one processing unit is configured to determine the upper and lower threshold (71 , 72) for each time interval (80) based on historical status parameters in accordance with the self-adaptive statistical model.
29. A computer program comprising program code to be executed by at least one processing unit of a fault detector, wherein execution of the program code causes the at least one processing unit to execute a method according to any of claims 1 to 16.
30. A carrier comprising the computer program of claim 29, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.
PCT/EP2015/073186 2015-10-07 2015-10-07 Anomaly detection in a data packet access network WO2017059904A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2015/073186 WO2017059904A1 (en) 2015-10-07 2015-10-07 Anomaly detection in a data packet access network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2015/073186 WO2017059904A1 (en) 2015-10-07 2015-10-07 Anomaly detection in a data packet access network

Publications (1)

Publication Number Publication Date
WO2017059904A1 true WO2017059904A1 (en) 2017-04-13

Family

ID=54291286

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2015/073186 WO2017059904A1 (en) 2015-10-07 2015-10-07 Anomaly detection in a data packet access network

Country Status (1)

Country Link
WO (1) WO2017059904A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3389220A1 (en) * 2017-04-14 2018-10-17 Solarwinds Worldwide, LLC Network status evaluation
CN109218071A (en) * 2018-07-17 2019-01-15 华为技术有限公司 Dial testing method and device under a kind of NFV environment
CN115221471A (en) * 2022-07-18 2022-10-21 山东云天安全技术有限公司 Abnormal data identification method and device, storage medium and computer equipment
WO2024018257A1 (en) 2022-07-19 2024-01-25 Telefonaktiebolaget Lm Ericsson (Publ) Early detection of irregular patterns in mobile networks

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120131188A1 (en) * 2009-07-08 2012-05-24 Allied Telesis Holdings K.K. Network concentrator and method of controlling the same
WO2015091785A1 (en) * 2013-12-19 2015-06-25 Bae Systems Plc Method and apparatus for detecting fault conditions in a network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120131188A1 (en) * 2009-07-08 2012-05-24 Allied Telesis Holdings K.K. Network concentrator and method of controlling the same
WO2015091785A1 (en) * 2013-12-19 2015-06-25 Bae Systems Plc Method and apparatus for detecting fault conditions in a network

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3389220A1 (en) * 2017-04-14 2018-10-17 Solarwinds Worldwide, LLC Network status evaluation
US10439915B2 (en) 2017-04-14 2019-10-08 Solarwinds Worldwide, Llc Network status evaluation
AU2018202047B2 (en) * 2017-04-14 2021-09-30 Solarwinds Worldwide, Llc Network status evaluation
CN109218071A (en) * 2018-07-17 2019-01-15 华为技术有限公司 Dial testing method and device under a kind of NFV environment
CN115221471A (en) * 2022-07-18 2022-10-21 山东云天安全技术有限公司 Abnormal data identification method and device, storage medium and computer equipment
CN115221471B (en) * 2022-07-18 2023-03-31 山东云天安全技术有限公司 Abnormal data identification method and device, storage medium and computer equipment
WO2024018257A1 (en) 2022-07-19 2024-01-25 Telefonaktiebolaget Lm Ericsson (Publ) Early detection of irregular patterns in mobile networks

Similar Documents

Publication Publication Date Title
KR102418969B1 (en) System and method for predicting communication apparatuses failure based on deep learning
US10855514B2 (en) Fixed line resource management
CN102308522B (en) Method, device and system for locating network fault
US6353902B1 (en) Network fault prediction and proactive maintenance system
US6239699B1 (en) Intelligent alarm filtering in a telecommunications network
US7007084B1 (en) Proactive predictive preventative network management technique
CN106789177A (en) A kind of system of dealing with network breakdown
EP2837168B1 (en) Diagnostic methods for twisted pair telephone lines based on line data distribution analysis
US7855952B2 (en) Silent failure identification and trouble diagnosis
US10708155B2 (en) Systems and methods for managing network operations
CA2768220A1 (en) Method and apparatus for telecommunications network performance anomaly events detection and notification
CN101189895A (en) Abnormality detecting method and system, and upkeep method and system
WO2017059904A1 (en) Anomaly detection in a data packet access network
CN109120338B (en) Network fault positioning method, device, equipment and medium
CN110650060A (en) Processing method, equipment and storage medium for flow alarm
US8149719B2 (en) System and method for marking live test packets
JP2015526920A (en) Apparatus, system and method for detecting and mitigating impulse noise
CN113572654A (en) Network performance monitoring method, network device and storage medium
WO2015180542A1 (en) Method and apparatus for detecting continuous-mode optical network unit, and network management device
EP2561646B1 (en) Apparatuses and methods for registering transmission capacities in a broadband access network
CN112532467B (en) Method, device and system for realizing fault detection
CN112491635A (en) Method, system, implementation equipment and storage medium for link quality detection
US8566634B2 (en) Method and system for masking defects within a network
CN105917614B (en) Method and apparatus for operating access network
CN113676403A (en) Relay line fault transfer method based on dynamic detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15778273

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15778273

Country of ref document: EP

Kind code of ref document: A1