WO2023156827A1

WO2023156827A1 - Anomaly detection

Info

Publication number: WO2023156827A1
Application number: PCT/IB2022/051475
Authority: WO
Inventors: Zhaoji HUANG; Sarang ARAVAMUTHAN; Angel Barranco; Kunal Rajan DESHMUKH
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2023-08-24

Abstract

A method (300) for anomaly detection. The method includes storing (s302) time series data, the stored time series data comprising a first set of N data points, wherein N > 2 and each data point in the first set of data points was obtained at a different point in time. The method further includes using (s304) the most current data point from the first set of N data points to determine whether or not to perform an anomaly detection process using at least N-1 of the N data points.

Description

ANOMALY DETECTION

TECHNICAL FIELD

[001] Disclosed are embodiments related to anomaly detection.

BACKGROUND

[002] Anomaly detection (AD) is a well-established use case in telecommunication systems to identify abnormal behavior in a network. The method is normally applied to time series data measuring different metrics characterizing the health of a part of the network (e.g., a cell). The method works by monitoring normal behavior to understand the pattern of data under standard conditions and set thresholds. When successive values over a prescribed period are beyond the threshold, an anomaly is flagged.

[003] In general, there are two types of anomaly detection techniques: simple statistical methods and machine learning-based approaches. Simple statistical methods are characterized by a light “footprint” (i.e., few computing resources) but limited by their inability to handle more challenging scenarios. Machine learning (ML)-based approaches can learn more sophisticated patterns but have larger footprints (i.e., require more computing resources). A survey of different anomaly detection techniques can be found at blogs(dot)oracle(dot)com/ai-and-datascience/post/introduction-to-anomaly-detection.

[004] When there is need to perform anomaly detection in a large network with limited resources, the statistical approach is generally preferred over an ML based approach.

SUMMARY

[005] Certain challenges presently exist. For instance, conventional AD systems perform an AD process each time that a new data point in the time series is obtained. This, however, is often inefficient because anomalies are detected only when several consecutive data points show an abnormal pattern. In particular, this inefficiency makes it difficult to scale the conventional AD systems solution to analyze 100s or 1000s of cells, which is a common occurrence in telecommunication systems.

[006] For example, assume that one were to perform the AD process for each cell in a network with N cells, where, for each cell, a time series data point is obtained every M minutes. In such a scenario, the total number of daily executions of the AD process is:

N x (1440 / M). Because anomalies, by definition, are rare, the efficiency of the AD system is low (z.e., the number of detected anomalies divided by the total number of daily executions is a low number).

[007] Accordingly, in one aspect there is provided an improved method for anomaly detection. The method may be performed by a network management node. The method includes storing time series data, the stored time series data comprising a first set of N data points, wherein N > 2 and each data point in the first set of data points was obtained at a different point in time. The method further includes using the most current data point from the first set of N data points to determine whether or not to perform an anomaly detection process using at least N-l of the N data points.

[008] In another aspect there is provided a computer program comprising instructions which when executed by processing circuitry of a network management node, causes the network management node to perform the above described method. In another aspect there is provided a carrier containing the computer program, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, or a computer readable storage medium.

[009] In another aspect there is provided a network management node where the network management node is configured to perform the methods disclosed herein. In some embodiments, the network management node includes processing circuitry and a memory containing instructions executable by the processing circuitry, whereby the network management node is configured to perform the methods disclosed herein.

[0010] The embodiments are advantageous in that they have a small energy footprint (i.e., reduced computation) which is beneficial for many reasons, including energy savings which leads to a low carbon footprint and enables scalability. With respect to scalability, with reduced computational effort per cell, the embodiments can be scaled to handle more cells within a sampling interval. This is a critical requirement for large scale Communication Service Providers who deploy tens-of-thousands of cells in a geographical region.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.

[0012] FIG. 1 illustrates a communication system according to an embodiment.

[0013] FIG. 2 is a flowchart illustrating a process according to some embodiments. [0014] FIG. 3 is a flowchart illustrating a process according to some embodiments.

[0015] FIG. 4 illustrates a network management node according to some embodiments.

DETAILED DESCRIPTION

[0016] FIG. 1 illustrates a communication system 100 according to an embodiment. Communication system 100 includes network nodes (e.g., base stations 111 and 112) that each serve one or more cells (e.g., cell 121 and 122). While only two network nodes are shown, it is possible that a communication system includes tens of thousands of network nodes or more, where each network node serves one or more cells.

[0017] Communication system 100 further includes a network management node 104, which, in the illustrated embodiment, includes a (1) data gathering function (DGF) 132 that functions to obtain and store time series data for each cell in system 100 and (2) an AD function (ADF) 134 that, for each cell for which time series data is collected, uses the stored time series data for the cell to detect whether or not the cell is experiencing an anomaly.

[0018] For instance, in one embodiment, data gathering function 132 creates and updates a database (DB) 190 that stores time series data having the following form:

TABLE 1

[0019] As shown in Table 1, the database 190 stores, for each of celll, cell2, and cell3, first time series data (i.e., a first set of data points) corresponding to a first performance metric (PM-1) (e.g., average latency, average throughput, cell downtime, etc.) for the cell and second time series data (i.e., a second set of data points) corresponding to a second performance metric (PM-2) (e.g., average latency, average throughput, cell downtime, etc.) for the cell.

[0020] The database is not limited to this form shown as the database can store time series data for any number of cells and/or any number of performance metrics. For instance, for each cell, the database may only store time series data for a single performance metric.

[0021] Moreover, for each data point stored in the database (e.g., data point vl 11), the database may also store a timestamp that indicates the time at which the data point was generated or received by data gathering function 132. In one embodiment, to save space, the database 190 only stores the most recent N data points for a given cell and given performance metric (this feature is illustrated in the table above which shows that for each cell/performance metric pair, the database stores at most N data points). Thus, in such an embodiment, when data gathering function 132 receives a new data point for a particular cell/performance metric pair and the database already has N data points for this particular cell/performance metric pair, the data gathering function 132 will remove from the database the oldest data point for this particular cell/performance metric pair and then add to the database the new data point for this particular cell/performance metric pair.

[0022] AD function 134 provides an efficient anomaly detection method by utilizing an “AD trigger function” to reduce the computational load of AD function 134 by reducing how often AD function 134 uses time series data to make a determination as to whether or not an anomaly is present. For example, in one embodiment, the AD trigger function is employed at least once every X units of time (e.g., at least once every 2 hours), and, based on the output of the AD trigger function, a decision is made as to whether to use the current time series data in database 190 to detect an anomaly or to wait until a later time and use at that later time the current time series data in database 190 to detect an anomaly. Thus, for instance, if the AD trigger function returns FALSE, then anomaly detection is not triggered until some later point in time.

[0023] FIG. 2 is a flowchart illustrating a process 200 according to an embodiment that is performed by AD function 134 for a given cell/performance metric pair. That is, AD function 134 performs process 200 for each cell/performance pair. Process 200 may begin in step s202. Step s202 comprises AD function 134 determining whether database 190 contains at least N data points for the given cell/performance metric pair under consideration. If not, AD function 134 goes back to performing step s202, otherwise it proceeds to step s204.

[0024] Step s204 comprises AD function 134 determining whether the most recently obtained data point for the given cell/performance metric pair under consideration (denoted VN) satisfies an AD triggering condition (e.g., in one embodiment in which the performance metric is average latency, AD function 134 determines whether VN is greater than a threshold; in another embodiment in which the performance metric is average throughput, AD function 134 determines whether VN is less than a threshold). If VN satisfies the AD triggering condition (e.g., VN is less than a threshold), then AD function 134 proceeds to step s206 in which AD function performs an AD process, otherwise it proceeds to step s210.

[0025] Step s206 comprises AD function 134 performing the AD process. That is, in step s206, AD function 134 uses at least the most recent N-l data points for the given cell/performance metric pair under consideration to determine whether an anomaly is present. In one embodiment, AD function 134 determines that an anomaly is present if all of the most recent N-l data points satisfy the AD triggering condition (e.g., if the performance metric is average throughput such that each data point is an average throughput value, then AD function determines that an anomaly is present if each one of the most recent N-l data points is less than the threshold). In another embodiment where the performance metric is average throughput, AD function 134 determines that an anomaly is present if all of the most recent N-l data points satisfy the AD triggering condition and VN is less than VN-I. Here, the second condition that VN is less than VN-I is a check to see if the anomalous trend is continuing (i.e., that the average throughput below the threshold is continuing to decrease with the latest data point VN).

[0026] In one embodiment, when an anomaly is detected, one or more actions are taken. These actions may include one or more of: (1) network management node 104 generating an alarm notification to notify the network operator that an anomaly has been detected; (2) network management node 104 adjusting one or more configuration parameters for the network node experiencing the anomaly; (3) network management node 104 attempting to reduce the load on the network node experiencing the anomaly by adjusting a load balancer to steer traffic away from said node; etc.

[0027] Step s208 comprises AD function 134 waiting until the next new data point is obtained. For instance, in one embodiment, a new data point is obtained every fifteen minutes. Thus, in this embodiment, in step s208 AD function 134 ends up waiting about fifteen minutes or less. After step s208 (e.g., after the next new data point is obtained), AD function 134 goes back to step s204.

[0028] Step s210 comprises AD function 134 waiting until N new data points for the given cell/performance metric are obtained. In the embodiment in which data gathering function 132 obtains a new data point every fifteen minutes and N=8, AD function 134 ends up waiting about 2 hours in step s210. After step s210, AD function 134 goes back to step s204.

[0029] The table below contains pseudo code of a computer program that can be used to implement process 200 where the performance metric is average latency and a second condition that VN is greater than VN-I is a check to see if the anomalous trend is continuing (i.e., that the average latency above the threshold Th is continuing to increase with the latest data point VN).

[0030] FIG. 3 is a flowchart illustrating a process 300 according to an embodiment that is performed by network management node 104. Process 300 may begin in step s302. Step s302 comprises storing time series data, the stored time series data comprising a first set of N data points, wherein N > 2 and each data point in the first set of data points was obtained at a different point in time. Step s304 comprises using the most current data point from the first set of N data points to determine whether or not to perform an anomaly detection process using at least N-l of the N data points.

[0031] In some embodiments, using the most current data point to determine whether or not to perform the anomaly detection process comprises comparing the data point to a threshold. In some embodiments, using the most current data point to determine whether or not to perform the anomaly detection process further comprises determining, based on the comparing, that a condition is satisfied (e.g., the data point is less than the threshold, the data point is greater than the threshold, the data point is not greater than the threshold, etc.).

[0032] In some embodiments the method further includes performing the anomaly detection process using at least N-l of the N data points as a result of determining that the condition is satisfied. In some embodiments the method further includes, after performing the anomaly detection process, obtaining a new current data point and using the new current data point to determine whether or not to perform the anomaly detection process using at least N-l of the most recent data points included in the first set of N data points. In some embodiments the method further includes performing the anomaly detection process using the N-l of the most recent data points included in the first set of N data points.

[0033] In some embodiments the method further includes, as a result of determining that the condition is not satisfied: refraining from performing the anomaly detection process; collecting a new set of N data points; and using the most current data point from the new set of N data points to determine whether or not to perform the anomaly detection process using at least N-l of the new set of N data points.

[0034] In some embodiments, the first set of N data points is associated with a cell of a mobile communication network, and each one of N data points included in the first set of N data points is a measure of a first performance metric for the cell. In some embodiments, the first performance metric is: an average throughput of the cell, an average latency associated with the cell, or a downtime for the cell. [0035] In some embodiments, the time series data further comprises a second set of N data points, wherein N > 2 and each data point in the second set of data points was obtained at a different point in time, and the method further comprises using the most current data point from the second set of N data points to determine whether or not to perform the anomaly detection process using the second set of N data points, wherein each one of N data points included in the second set of N data points is a measure of a second performance metric for the cell, and the second performance metric for the cell is different than the first performance metric for the cell.

[0036] FIG. 4 is a block diagram of network management node 104, according to some embodiments, for performing network node methods disclosed herein. As shown in FIG. 4, network node 104 may comprise: processing circuitry (PC) 402, which may include one or more processors (P) 455 (e.g., one or more general purpose microprocessors and/or one or more other processors, such as an application specific integrated circuit (ASIC), field- programmable gate arrays (FPGAs), and the like), which processors may be co-located in a single housing or in a single data center or may be geographically distributed (i.e., network node 104 may be a distributed computing apparatus where some function are performed in one location and other functions performed in another location); at least one network interface 448 comprising a transmitter (Tx) 445 and a receiver (Rx) 447 for enabling network node 104 to transmit data to and receive data from other nodes connected to a network 110 (e.g., an Internet Protocol (IP) network) to which network interface 468 is connected;; and a local storage unit (a.k.a., “data storage system”) 408, which may include one or more nonvolatile storage devices and/or one or more volatile storage devices. In embodiments where PC 402 includes a programmable processor, a computer readable medium (CRM) 442 may be provided and store a computer program (CP) 443 comprising computer readable instructions (CRI) 444. CRM 442 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 444 of computer program 443 is configured such that when executed by PC 402, the CRI causes network node 104 to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, network node 104 may be configured to perform steps described herein without the need for code. That is, for example, PC 402 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software. [0037] Conclusion

[0038] By employing the “AD trigger function,” the AD process is triggered less often without missing any anomalies. That is, the embodiments have the same performance in anomaly detection as a conventional method, but the embodiments use fewer computation resources. The saved computation resources could be used to process more cells or for other purposes or not used at all, thereby reducing energy consumption.

[0039] Results

[0040] Table 2 below shows the benchmark results over a 7-day period with and without the smart sampling approach described above.

TABLE 2

[0041] While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.

[0042] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.

Claims

CLAIMS:

1. A method (300) for anomaly detection, the method comprising: storing (s302) time series data, the stored time series data comprising a first set of N data points, wherein N > 2 and each data point in the first set of data points was obtained at a different point in time; and using (s304) the most current data point from the first set of N data points to determine whether or not to perform an anomaly detection process using at least N-l of the N data points.

2. The method of claim 1, wherein using the most current data point to determine whether or not to perform the anomaly detection process comprises comparing the data point to a threshold.

3. The method of claim 2, wherein using the most current data point to determine whether or not to perform the anomaly detection process further comprises determining (s204), based on the comparing, that a condition is satisfied.

4. The method of claim 3, further comprising, as a result of determining that the condition is satisfied, performing (s206) the anomaly detection process using at least N-l of the N data points.

5. The method of claim 4, further comprising: after performing the anomaly detection process, obtaining (s208) a new current data point; and using the new current data point to determine whether or not to perform the anomaly detection process using the new current data point and N-l of the most recent data points included in the first set of N data points.

6. The method of claim 5, further comprising performing the anomaly detection process using the new current data point and N-l of the most recent data points included in the first set of N data points

7. The method of claim 3, further comprising, as a result of determining that the condition is not satisfied: refraining from performing the anomaly detection process; collecting (s210) a new set of N data points; and using the most current data point from the new set of N data points to determine whether or not to perform the anomaly detection process using at least N-l of the new set of N data points.

8. The method of any one of claims 1-7, wherein the first set of N data points is associated with a cell of a mobile communication network, and each one of N data points included in the first set of N data points is a measure of a first performance metric for the cell.

9. The method of claim 8, wherein the first performance metric is: a throughput of the cell, a latency associated with the cell, or a downtime for the cell.

10. The method of claim 8 or 9, wherein the time series data further comprises a second set of N data points, wherein N > 2 and each data point in the second set of data points was obtained at a different point in time, and the method further comprises using the most current data point from the second set of N data points to determine whether or not to perform the anomaly detection process using the second set of N data points, wherein each one of N data points included in the second set of N data points is a measure of a second performance metric for the cell, and the second performance metric for the cell is different than the first performance metric for the cell.

11. A computer program (443) comprising instructions (444) which when executed by processing circuitry (402) of a network management node (104) causes the network management node to perform the method of any one of claims 1-10.

12. A network management node, the network management node being configured to: store time series data, the stored time series data comprising a first set of N data points, wherein N > 2 and each data point in the first set of data points was obtained at a different point in time; and use the most current data point from the first set of N data points to determine whether or not to perform an anomaly detection process using at least N-l of the N data points.

13. The network management node of claim 12, wherein using the most current data point to determine whether or not to perform the anomaly detection process comprises comparing the data point to a threshold.

14. The network management node of claim 13, wherein using the most current data point to determine whether or not to perform the anomaly detection process further comprises determining, based on the comparing, that a condition is satisfied.

15. The network management node of claim 14, wherein the network management node is further configured to, as a result of determining that the condition is satisfied, perform the anomaly detection process using at least N-l of the N data points.

16. The network management node of claim 15, wherein the network management node is further configured to: after performing the anomaly detection process, obtain a new current data point; and use the new current data point to determine whether or not to perform the anomaly detection process using the new current data point and N-l of the most recent data points included in the first set of N data points.

17. The network management node of claim 16, wherein the network management node is further configured to perform the anomaly detection process using the new current data point and N-l of the most recent data points included in the first set of N data points

18. The network management node of claim 14, wherein the network management node is further configured to, as a result of determining that the condition is not satisfied: refrain from performing the anomaly detection process; collect a new set of N data points; and use the most current data point from the new set of N data points to determine whether or not to perform the anomaly detection process using at least N-l of the new set of N data points.

19. The network management node of any one of claims 12-18, wherein the first set of N data points is associated with a cell of a mobile communication network, and each one of N data points included in the first set of N data points is a measure of a first performance metric for the cell.

20. The network management node of claim 19, wherein the first performance metric is: a throughput of the cell, a latency associated with the cell, or a downtime for the cell.

21. The network management node of claim 19 or 20, wherein the time series data further comprises a second set of N data points, wherein N > 2 and each data point in the second set of data points was obtained at a different point in time, and the method further comprises using the most current data point from the second set of N data points to determine whether or not to perform the anomaly detection process using the second set of N data points, wherein each one of N data points included in the second set of N data points is a measure of a second performance metric for the cell, and the second performance metric for the cell is different than the first performance metric for the cell.

22. A network management node (104) comprising: processing circuitry (402); and a memory (442), the memory containing instructions (444) executable by the processing circuitry, whereby the network management node is operative to perform the method of any one of claims 1-10.