WO2021249629A1

WO2021249629A1 - Device and method for monitoring communication networks

Info

Publication number: WO2021249629A1
Application number: PCT/EP2020/065990
Authority: WO
Inventors: Cristian-Alexandru Olariu; MingXue Wang; Peng Hu; Hitham Ahmed Assem Aly SALAMA
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2021-12-16

Abstract

The present disclosure relates to a device for monitoring a communication network. The device obtains data including topology information, wherein the topology information is indicative of a plurality of entities of the communication network and one or more interactions between some or all of the plurality of entities. The device further obtains a plurality of alarms, wherein each alarm is associated with at least one of the plurality of entities, and correlates the plurality of alarms into one or more groups of alarms, wherein each group of alarms is associated with a subset of the data that includes a subset of the topology information. Moreover, the device identifies one or more incidents from the one or more groups of alarms, based on an estimation of a root cause probability for each group of alarms according to its associated subset of the data.

Description

DEVICE AND METHOD FOR MONITORING COMMUNICATION NETWORKS

TECHNICAL FIELD

The present disclosure relates generally to communications networks, and particularly to monitoring communication networks. To this end, a device and a method for monitoring a communication network are disclosed. For example, the disclosed device and method may support performing a Root Cause Analysis (RCA), and/or identifying an incident or a root cause of a problem.

BACKGROUND

Generally, communication networks (e.g., telecommunication networks) include many components running in a complex environment. Moreover, communication networks are vulnerable to problems (such as faults and/or incidents) that may occur, for example, due to hardware or software configurations, or changes in the communication networks, etc.

Conventional devices and methods for identifying incidents or performing RCA are based on monitoring a performance and health of large-scale distributed heterogeneous computing systems at various locations (e.g., physical machine logs, software stack traces, etc.). The monitoring process may require data (such as numerical data, textual data, etc.) from the whole system. Further, the collected data may be used to extract insights into the health state of the system, and this is achieved mainly through raising alarms when the system behaves differently than expected.

The number of alarms raised by such a system may fall under the category of big data. It is generally desirable to improve the process of finding an incident or root cause in a vast amount of alarms generated at a big data scale.

SUMMARY

In view of the above-mentioned problems and disadvantages, embodiments of the present disclosure aim to improve conventional devices and methods for monitoring a communication network. An objective is to provide a device and a method that can identify an incident or a root cause of a problem in the network. Another objective is to provide a device and method that can obtain a dataset from the communication network and use it for efficiently identifying an incident or a root cause of a problem in the communication network. Another objective is to provide a device and method that can provide, as an output, an identified incident or perform an RCA for a problem.

The above mentioned one or more objectives are achieved by the embodiments of the disclosure as described in the enclosed independent claims. Advantageous implementations of the embodiments of the disclosure are further defined in the dependent claims.

A first aspect of the present disclosure provides a device for monitoring a communication network, the device being configured to obtain data including topology information, wherein the topology information is indicative of a plurality of entities of the communication network and one or more interactions between some or all of the plurality of entities. The device is further configured to obtain a plurality of alarms, wherein each alarm is associated with at least one of the plurality of entities, correlate the plurality of alarms into one or more groups of alarms, wherein each group of alarms is associated with a subset of the data that includes a subset of the topology information, and identify one or more incidents from the one or more groups of alarms, based on an estimation of a root cause probability for each group of alarms according to its associated subset of the data.

For example, the estimation of the root cause probability of a given group of alarms represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network.

The device may be, or may be incorporated in, an electronic device such as a computer, a personal computer (PC), a tablet, a laptop, a network entity, a server computer, a client device, etc.

The communication network may comprise the plurality of entities that may interact with each other. The plurality of entities of the communication network may comprise any network entity, such as a physical entity or a logical entity, or a network node, or a network element of the communication network. For instance, a physical entity may be a server, a router, or a switch in the communication network. A logical entity may be a logically separate entity with a well- defined functionality in the communication network, like a network function. Moreover, the device may obtain the data including the topology inforamtion related to the plurality of entities and the interactions between entities. For example, the data may be obtained directly from the communication network, or it may be obtained indirectly from a monitoring system. Furthermore, for example, the plurality of alarms may be obtaind by the device based on an alarming system flagging abnormal behaviors in the communication network, or by the monitoring system capturing information about the alarms and their associated entities, in the communication network.

The device of the first aspect may perform a root cause analysis (e.g., a temporal graph-based root cause analysis) to detect an entity, which is the cause of the abnormal event that triggered a chain of alarms (e.g., included in the plurality of alarms) in the communication network.

In some embodiments, the device may identify an incident (e.g., identify an entity responsible for a fault) by leveraging interaction events logged across the plurality of entities of the communication network. In some embodiments, the device may store obtained data in a graph, and perform the RCA process on this kind of data embedding structures or the graphs.

In some embodiments, the device may include, in the RCA, entities that did not raise any alarm, but may act as dependency links between active problematic entities.

The device of the first aspect may address the problem of Site Reliability Engineers (SRE) by starting an investigation into the root cause of an incident. For example, the device may estimate root cause probabilities, in order to identify entities (e.g., network nodes of the communication network) that are more likely to be the root cause of an incident or the root cause of a problem.

In the following, the terms “entity” and “node” are used interchangeably, without limiting the present disclosure.

In an implementation form of the first aspect, the device is further configured to determine one or more subsets of the data based on the one or more groups of alarms and the obtained data.

In a further implementation form of the first aspect, the device is further configured to obtain hierarchical-structured data based on the obtained data according to one or more criteria, wherein the plurality of entities in the hierarchical- structured data have hierarchical dependency relationships, wherein the hierarchical- structured data comprises a plurality of links, and wherein each link represents one or more hierarchical dependency relationships between the plurality of entities of the communication network.

For example, the device may obtain topology information that may indicate the entities (nodes, such as logical nodes or physical nodes) and their interactions. Such interactions are usually done via interfaces. As an example of the hierarchical- structured data, the device may obtain a graph, in which the connections are represented using directed edges, so as to capture the bi directionality of the interactions. Further, the raised alarms (which may contain timestamps) may be associated directly or indirectly with one of the entities (nodes) in the graph. Moreover, the device may group the alarms using temporal (e.g., using the timestamps) and topological information.

In a further implementation form of the first aspect, the device is further configured to obtain one or more subsets of the hierarchical-structured data, based on the hierarchical- structured data and the one or more groups of alarms, wherein each subset of the hierarchical-structured data comprises a first entity having at least one hierarchical dependency relationship to at least one other entity, a second entity having no hierarchical dependency relationship to another entity, and a third entity located between the first entity and the second entity.

The subsets of the hierarchical- structured data may be, for example, a sub-graph.

In some embodiments, the groups of alarms may affect a subset of the graph, and the device may use that sub-graph to perform the estimation of the root cause probability. For example, the sub-graphs may be processed to extract roots, leaves and nodes in-between.

In the following, when referring to the hierarchical -structured data or graphs or sub-graphs, the following terms are used interchangeably, without limiting the present disclosure:

• “root entity” and “first entity”

• “leaf entity and “second entity”

• “intermediary entity” and “third entity” and

• “leaf probability” and “interval probability”. In a further implementation form of the first aspect, the device is further configured to determine a number of alarms associated with each of the first entity, the second entity, and the third entity of at least one subset of the hierarchical- structured data, based on the one or more groups of alarms.

For example, each of the entities may be associated with a number of alarms. Further, for each leaf entity, the sub-graph may be traversed from all root entities to that leaf entity, across all possible paths. Moreover, the device may accumulate the number of alarms, which can be considered as affected by that leaf entity. This process may be repeated for all leaf entities, and, at the end, the sums are normalized and root cause probabilities may be extracted from the normalized sums.

In a further implementation form of the first aspect, the device is further configured to obtain an additional alarm associated with at least one of the plurality of entities, and add the additional alarm to a group from the one or more groups of alarms or to a new group.

In a further implementation form of the first aspect, the device is further configured to adjust, when adding the additional alarm to the group from the one or more groups of alarms, the estimation of the root cause probability for that group.

For example, the device may include information related to the temporality of incident build up and may further adjust the root cause probabilities with each newly added alarm to a group of alarms.

In a further implementation form of the first aspect, the device is further configured to correlate the plurality of alarms into the one or more groups of alarms based further on temporal information of the communication network.

In a further implementation form of the first aspect, the device is further configured to determine, based on the temporal information of the communication network, one or more time intervals, at which the plurality of alarms are generated. In a further implementation form of the first aspect, the device is further configured to estimate, for each of the time intervals, an interval probability by determining one or more possible paths connecting the second entity and the first entity in the subset of the hierarchical- structured data, and determining alarms associated with entities arranged in the one or more possible paths.

In a further implementation form of the first aspect, the device is further configured to estimate the root cause probability for each group associated to a subset of the hierarchical-structured data based on the determined interval probabilities of that subset of the hierarchical-structured data.

In a further implementation form of the first aspect, the device is further configured to estimate, based on the temporal information of the communication network, a prior root cause probability for at least one entity in at least one subset of the hierarchical-structured data, and determine, based on the estimated prior root cause probability, a root cause probability for the at least one entity in a current group to be a root cause of an incident.

In a further implementation form of the first aspect, the device is further configured to determine, based on the temporal information of the communication network, a temporal weighting function for the one or more groups of alarms, and apply the determined temporal weighting function to the estimation of the root cause probability of at least one group of alarms.

A second aspect of the disclosure provides a method for monitoring a communication network, the method comprising obtaining data including topology information, wherein the topology information is indicative of a plurality of entities of the communication network and one or more interactions between some or all of the plurality of entities, obtaining a plurality of alarms, wherein each alarm is associated with at least one of the plurality of entities, correlating the plurality of alarms into one or more groups of alarms, wherein each group of alarms is associated with a subset of the data that includes a subset of the topology information, and identifying one or more incidents from the one or more groups of alarms, based on an estimation of a root cause probability for each group of alarms according to its associated subset of the data. For example, the estimation of the root cause probability of a given group of alarms represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network.

In an implementation form of the second aspect, the method further comprises determining one or more subsets of the data based on the one or more groups of alarms and the obtained data.

In a further implementation form of the second aspect, the method further comprises obtaining hierarchical-structured data based on the obtained data according to one or more criteria, wherein the plurality of entities in the hierarchical- structured data have hierarchical dependency relationships, wherein the hierarchical- structured data comprises a plurality of links, and wherein each link represents one or more hierarchical dependency relationships between the plurality of entities of the communication network.

In a further implementation form of the second aspect, the method further comprises obtaining one or more subsets of the hierarchical-structured data, based on the hierarchical- structured data and the one or more groups of alarms, wherein each subset of the hierarchical-structured data comprises a first entity having at least one hierarchical dependency relationship to at least one other entity, a second entity having no hierarchical dependency relationship to another entity, and a third entity located between the first entity and the second entity.

In a further implementation form of the second aspect, the method further comprises determining a number of alarms associated with each of the first entity, the second entity, and the third entity of at least one subset of the hierarchical-structured data, based on the one or more groups of alarms.

In a further implementation form of the second aspect, the method further comprises obtaining an additional alarm associated with at least one of the plurality of entities, and adding the additional alarm to a group from the one or more groups of alarms or to a new group.

In a further implementation form of the second aspect, the method further comprises adjusting, when adding the additional alarm to the group from the one or more groups of alarms, the estimation of the root cause probability for that group. In a further implementation form of the second aspect, the method further comprises correlating the plurality of alarms into the one or more groups of alarms based further on temporal information of the communication network.

In a further implementation form of the second aspect, the method further comprises determining, based on the temporal information of the communication network, one or more time intervals, at which the plurality of alarms are generated.

In a further implementation form of the second aspect, the method further comprises estimating, for each of the time intervals, an interval probability by determining one or more possible paths connecting the second entity and the first entity in the subset of the hierarchical-structured data, and determining alarms associated with entities arranged in the one or more possible paths.

In a further implementation form of the second aspect, the method further comprises estimating the root cause probability for each group associated to a subset of the hierarchical-structured data based on the determined interval probabilities of that subset of the hierarchical-structured data.

In a further implementation form of the second aspect, the method further comprises estimating, based on the temporal information of the communication network, a prior root cause probability for at least one entity in at least one subset of the hierarchical-structured data, and determining, based on the estimated prior root cause probability, a root cause probability for the at least one entity in a current group to be a root cause of an incident.

In a further implementation form of the second aspect, the method further comprises determining, based on the temporal information of the communication network, a temporal weighting function for the one or more groups of alarms, and applying the determined temporal weighting function to the estimation of the root cause probability of at least one group of alarms.

A third aspect of the present disclosure provides a computer program comprising a program code for performing the method according to the second aspect or any of its implementation forms. A fourth aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the second aspect or any of its implementation forms to be performed.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 depicts a schematic view of a device for monitoring a communication network, according to an embodiment of the disclosure;

FIG. 2 depicts a schematic view of a flowchart of a procedure for identifying an incident based on an estimation of a root cause probability;

FIG. 3 depicts a schematic view of a diagram illustrating an example of obtained hierarchical-structured data;

FIG. 4 depicts a schematic view of a diagram illustrating an example of a subset of the hierarchical-structured data obtained based on a group of alarms;

FIG. 5A-5B depict schematic views of a numerical example used for obtaining the interval probabilities (FIG. 5A), and a subgraph (FIG. 5B); FIG. 6A-6B depict schematic views of diagrams illustrating connected leaf entities and a single entity and leaf connected;

FIG. 7 depicts a schematic view of a diagram illustrating an example of applying a temporal weighting function to the estimation of the root cause probability of a group of alarms;

FIG. 8 depicts a schematic view of a diagram illustrating the device identifying an incident in an incident management system; and

FIG. 9 depicts a schematic view of a flowchart of a method for monitoring a communication network, according to an embodiment of the disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 depicts a schematic view of a device 100 for monitoring a communication network 1 according to an embodiment of the invention.

The device 100 may be, or may be incorporated in, an electronic device, for example, a computer, a laptop, a network entity, etc.

The device 100 is configured to obtain data 110 including topology information 111. The topology information 111 is indicative of a plurality of entities of the communication network 1. The plurality of entities of the communication network 1 may comprise physical entities and/or logical entities, without limiting the present disclosure in that regard. The topology information 111 is further indicative of one or more interactions between some or all of the plurality of entities. The interactions may comprise any interaction that occurs between the entities of the communication network 1. For example, the device 100 may obtain the data 110, and may further parse each interaction that occurred in the communication system 1.

The device 100 is further configured to obtain a plurality of alarms 121, 122, 123. Each alarm 121, 122, 123 is associated with at least one of the plurality of entities. For example, an entity of the communication network may raise an alarm (e.g., the alarm may be raised when the communication network behaves differently than expected). Moreover, the raised alarm may be associated with the entity that raised the alarm. In some embodiments, a raised alarm may be associated with more than one entity. For example, a raised alarm may be associated, directly, with an entity that raised the alarm, and it may further be associated, indirectly, with an entity that did not raise the alarm.

The device 100 is further configured to correlate the plurality of alarms 121, 122, 123 into one or more groups of alarms 131, 132. Moreover, each group of alarms 131, 132 may be associated with a subset of the data that includes a subset of the topology information.

Furthermore, the device 100 is configured to identify one or more incidents 141 from the one or more groups of alarms 131, 132, based on an estimation of a root cause probability for each group of alarms 131, 132 according to its associated subset of the data.

For example, the alarms 122 and 123 may be correlated to one group of alarms 132. Moreover, the alarm 122 may be associated with a subset of data from the obtained data 110 that is affected by the alarm 122. Furthermore, the alarm 123 may be associated with a subset of data from the obtained data 110 that is affected by the alarm 123. Moreover, the subset of data including the topology information associated with the alarms 122 and 123 may be used for the estimation of the root cause probability of the group of alarms 132.

The estimation of the root cause probability of a given group of alarms 131, 132 represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network 1.

Hence, the device 100 may be able to identify an incident from the one or more groups of alarms 131, 132. For example, the device 100 may optionally have a decision unit 140, which may identify the incident.

The device 100 may comprise a processing circuitry (not shown in FIG. 1) configured to perform, conduct or initiate the various operations of the device 100 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field- programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. In one embodiment, the processing circuitry comprises one or more processors and a non- transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the device 100 to perform, conduct or initiate the operations or methods described herein.

FIG. 2 depicts a schematic view of a flowchart of a procedure 200 for identifying an incident 141 based on an estimation of a root cause probability.

In FIG. 2, it is assumed that the plurality of entities of the communication network are based on nodes that are forming the communication network, and each node of the communication network has a name (e.g., an identifier). Further, the interactions between the nodes are performed via interfaces that are referred to as edges, without limiting the present disclosure in that regard.

At step S201, the device 100 obtains data 110 and parses interactions between the plurality of entities.

For example, the nodes of the communication network 1 interact with each other. The interactions may comprise sending or receiving a request in the communication network 1. Further, the device may parse each interaction event that happens in the communication network 1. Parsing the interactions may comprise detecting the nodes (e.g., deriving names of nodes involved in the interactions) that are the originator and the receiver of a request.

Furthermore, each request may be done via an interface, and there may be multiple interfaces between the same two nodes. The interfaces may comprise any known interface, and may have a name (e.g., an identifier such as il, i2, etc.). Moreover, the device 100 may use the names of the nodes and the names of the interfaces to build a topology of the communication network (e.g., at step S202). For example, the topology may be a graph-representation of the communication network, in which the interfaces (e.g., the names of the interfaces) are used as the edges, in order to represent the interactions between the nodes.

At step S202, the device 100 obtains hierarchical-structured data 310. For example, the hierarchical-structured data 310 is graph- structured data that is referred to as G in FIG. 2. G may be a master graph or a master topology that indicates the overall topology of the communication network 1.

For instance, the device 100 may initialize a procedure comprising producing an empty graph. Further, as new interaction events are processed during step S201, the device 100 adds the names of the nodes and the names of the interfaces as edges to the graph G. For example, the device 100 may build a representation of the topology of the communication network based on the interactions observed in the event logs across the monitored period. The representation of the topology may be static and may not consider the temporality aspects or the time information of the occurred interactions in the graph G.

Moreover, as the interactions occur generally via named interfaces, there may be multiple directional connections from a first entity (e.g., a node A) to a second entitty (e.g., a node B), and each of these directional connections may be assigned a different name.

A directional connection is shown as an arrow having a direction and being outbound from the first entity and being inbound to the second entity.

Hence, the device 100 obtains the hierarchical-structured data 310 that may be referred to as master graph or master topology graph G , which is a multi-directional graph representing the overall topology of the communication network. An example of graph G is depicted in FIG. 3.

Reference is now made to FIG. 3, which is a schematic view of an example of obtained hierarchical-structured data 310.

As disscussed, the device 100 obtains hierarchical -structured data 310 based on the obtained data 110. The hierarchical-structured data 310 shown in FIG. 3 is indicative of the overall topology of the communication network. For example, the device may peform a mapping procedure for obtaining the hierarchical-structured data 310.

Next, an exemplary mapping procedure is disscussed for obtaining the hierarchical structured data that is a representation of the communication network. The central node in FIG. 3 acts as a root entity for the whole communication network. The plurality of entities of the communication network are based on a root entity, external entities, and internal entities. An external entity is an entity that requests resources. An internal entity is an entity that provide resources. Further, the root entity (the central node shown in FIG. 3) represents the main contact point from external entities that need access to resources managed by the internal entities, depicted as outbounding arrows (radiating out) from the root node. In the hierarchical- structured data 310, each arrow (line or link) depicts the dependency, for example, where the arrow points out from the dependent entity to another entity, it is fulfilling any request on behalf of its parent entity (the entity that initiated a request for the resources). Examples of such architectures are content-delivery networks, micro-service platforms, web-services, basic routing networks, etc.

At step S203, the device 100 correlates the plurality of alarms 121, 122, 123 into groups of alarms 131, 132. For example, the plurality of alarms may be grouped based on temporal information (e.g., using timestamps) and topological information (e.g., the hierarchical- structured data 310 constructed from the interactions between the entities at step S202).

Each group of alarms may affect a subset of the hierarchical-structured data 310 (i.e., a subset of the whole graph shown in FIG. 3), and the device 100 uses that affected subset of the hierarchical-structured data (hereinafter referred to as “the sub-graph”) to estimate the root cause probability for that group of alarms.

Furthermore, depending on the estimated root cause probability, the group of alarms may be identified as an incident in the communication network 1. In some embodiments, the device 100 may further determine the root cause of the identified accident.

At step S204, the device 100 builds a sub-graph (referred to as G’ in FIG. 2). For example, the sub-graph G’ is a subset of the hierarchical-structured data 310 that is affected by an alarm.

In some embodiments, the step S204 may be invoked each time a new group of alarms is created, for example, each time that a new group of alarms is created at step S203, the device 100 invokes step S204 to build a sub-graph for that newly created group of alarms. The device 100 may parse the alarms and may extract the timestamp and the entity (i.e., the name of the entities) from each alarm in the incoming stream of alarms. The obtained alarms may indicate the name of the entities, in which the alarms are originated, or alternatively, the device 100 may use a natural text analytics to match the alarms’ context to one of the entities of the communication (e.g., to one of the entities in the graph G or n_i e G).

Furthermore, once a match is established, the matching entity may be appended to a list of matching entities, named L. The device 100 may also store the number of times the same entity is matched to one of the alarms. Moreover, the device 100 may obtain (by parsing all alarms in the groups of alarms) a list of all entities and their corresponding alarm count, and timestamps of raising each alarm.

In some embodiments, the groups of alarms 131, 132 may be updated upon the receipt of new alarms, and the step S204 may be triggered during each updating process.

Reference is now made to FIG. 4, which is a schematic view of an example of a subset of the hierarchical-structured data 410.

The subset of the hierarchical-structured data 410 shown in FIG. 4 is a subgraph of the hierarchical-structured data 310 and is obtained after correlating the alarms and obtaining the group of alarms.

For example, the device 100 obtains the subset of the hierarchical-structured data 410, or the sub-graph shown in FIG. 4, by constructing a new graph, named G’, based on the list of matched entities to an alarm, and by using the hierarchical-structured data 310 (master topology graph G) to extract the links and their directions. The resulting sub-graph G’ embeds the entities currently affected by the alarms, and the magnitude of the impact which is represented by the number of alarms. For example, in the sub-graph G’, “tm-behavior#4” represents four alarms that are associated with the entity having the name “tm-behavior”.

The sub-graph G’ also depicts the hierarchy of how the incident’s alarms propagated in the communication network. In other words, a given cause propagates from the origins (leaf nodes 412) upstream, and affects other nodes which rely on the services of the leaf nodes (root nodes 411 and intermediary nodes 413). Moreover, in some embodiments, there may be a matched entity that is not connected with other matching entities, and these are referred to as single entities and are considered as roots in the sub-graph’s hierarchy. For instance, in the hierarchical-structured data 410, the entity having the name of “tm-order’5” represents an extracted node name that does not match any node in the gaph G, and has no contribution to the root cause analysis of the entities.

The subgraph shown in FIG. 4 is a subset of the hierarchical-structured data 410 that includes two first entities 411, referred to as “tm-behavior#4” (shown as a root entity having hierarchical dependency relationship to other entities) and “tm-order#5”. The hierarchical- structured data 410 of FIG. 4 further includes two second entities 412 (each shown as a leaf having no hierarchical dependency relationship to another entity) referred to as “ESGE#2” and “tm- operation#6”, and one third entity 413 shown as an intermediary entity and referred to as “tm- search#6”.

At step S205, the device 100 estimates, for each of the time intervals, an interval probability (also referred to as leaf probability in FIG. 2).

For example, the device 100 may determine the time intervals at which the plurality of alarms 121, 122, 123 are generated. Moreover, for each of the time intervals, the device 100 may estimate its respective interval probability by determining the possible paths connecting the second entity 412 and the first entity 411 in the subset of the hierarchical-structured data 410. Furthermore, the device 100 may determine the number of alarms that are associated with the first entity 411, the second entity 412, and the third entity 413 in each possible path.

Reference is now made to FIG. 5A and FIG. 5B, which are schematic views of a numerical example for obtaining the interval probabilities (shown in diagram 500A of FIG. 5 A) for the subgraph (the subset of hierarchical- structured data 410) shown in FIG. 5B.

The diagram 500A of FIG 5 A may be performed by the device 100 at step S205 of the procedure 200 shown in FIG. 2.

For example, the device 100 may repeat this process at many time intervals the group of alarms is split in. That means, with each new alarm received by the communication network, the representation of the communication network in the hierarchical- structured data 310 or the subset of hierarchical-structured data 410 (the sub-graph) may be updated with the entity that is associated with the recieved alarm, and the following procedure may be repeated.

The sub-graph G’ is a subset of hierarchical-structured data 410 that comprises the entities 411, 412, 413, their multi-directional links, and the number of associated alarms. For example, the entity 411 in the sub-graph 410 has a name of “tm-apigw” and a number of 138 alarms are associated with the entity 411.

The sub-graph G’ 410 shown in FIG. 5B may be parsed by the device 100, as follows:

• At first, the device 100 may determine for each second entity 412 (leaf entity), all the possible paths between that leaf enity and all the first entities 411 (root entities) of the subgraph 410.

• At second, the device 100 may pass each entity in each possible path, and may accumulate the number of alarms that are associated with that passed entity.

• At third, the device 100 may add, only once, the number of alarms associated with the current second entity 412 (leaf node) that is under the process of caluclating the interval probability (leaf probability).

The result of the above process may yield a number of associated alarms for each second entity 412 (leaf entity). This sum is proportional with the impact of each leaf entity to the overall incident. As it can be taken from FIG. 5 A and FIG. 5B, the entity referred to as “tm-odp#7” is affected by 76 alarms and the entity referred to as “tm-recommend’18” is affected by 302 alarms. Further, each leaf entity’s sum can be normalized by the grand total sum across all leaf entities, and that may yield a probability estimation for the root cause probability of the entity and/or the root cause probability of the group of alarms. The root cause probability of the entity referred to as “tm-odp#7” is 0.2, and the root cause probability of the entity referred to as ““tm- recommend’ 18” is 0 8

In some embodiments, the device 100 may further perform a procedure for an augmentation of missing entities.

For example, in some embodiments, it may be possible that an actual root cause of an incident has not yet raised any alarm. Furthermore, the device 100 may use the dynamic topology of historical interactions (captured in the hierarchical-structured data or G) to augment the incident graph (the subset of hierarchical- structured data or G’), which is a sub-graph of the whole entity graph G. For example, the device 100 may search the larger graph for determining connections between leafs in G^' that include nodes that did not raise an alarm.

An example is shown in diagram 600 A of FIG. 6 A and diagram 600B of FIG. 6B, illustrating connected leaf entities (FIG. 6A) and a single entity and leaf connected (FIG. 6B).

In diagram 600 A of FIG. 6 A, the entity referred to as “tm-search#2” 611 has dependency connections to the entity referred to as “tm-order#l” 612. Hence, any alarm raised by the entity “tm-order#l” 612 may potentially affect the entity “tm-search#2” 611. Therefore, the “tm- order#l” entity’s affected alarm count will include paths from the root to it that traverse the “tm-search#2” entity 611 as well. In the example shown in diagram 600A of FIG 6 A, the entity referred to as “tm-order#l” 612 is suggested as the root cause of an incident.

Moreover, in diagram 600B of FIG. 6B, the entity referred to as “tm-search#6” 611 has no dependency connections to the entity referred to as “tm-order#5” 612. In this case, the connection disscussed with respect to FIG. 6A cannot be considered. Therefore, without considering the connection from the the “tm-search#6” entity to the “tm-topic#0” entity and further to the “tm-order#l” entity, the entity referred to as “tm-search#6” may be identified as the root cause of the incident.

At step S206 of the procedure 200 shown in FIG. 2, the device 100 may determine a temporal weighting function, and further apply the determined temporal weighting function to the estimation of the root cause probability of a group of alarms.

For example, in some embodiments, alarms may be generated at different times. Moreover, the device 100, when correlating the plurality of alarms, for obtaining groups of alarms, may add these alarms either to an existing incident, or to a new incident. Moreover, in the case that a new alarm is added to an existing incident, the root cause probabilities may further be adjusted.

For instance, in some embodiments, the device 100 may further obtain a state transition matrix (S.T.M) comprising the states of the generated incidents in the communication network 1. Moreover, when a new alarm is aggregated to an existing incident, a new state of the existing incident is generated, and may be included in the state transition matrix (referred to as S.T.M. in FIG. 2). Furthermore, for each new state, the device 100 may compute the root cause probabilities, and may further compute a timeline of probabilities. Finally, the device 100 may apply the temporal weighting function to these timeline of probabilities.

In some embodiments, the device 100 may learn (e..g, it may derive) the temporal weighting function from historical prior interactions (e.g., typical time information between raised alarms, typical duration time between the first and the last alarm in an incident, etc.). In one implementation, an exponential weighting function may be used, however, generally, the device 100 may use any weighting function.

FIG. 7 shows a diagram 700 illustrating an example of applying a temporal weighting function to the estimation of the root cause probability of a group of alarms.

As can be derived from diagram 700 of FIG. 7, for each new state, the device 100 recomputes (estimates again) the root cause probabilities (the estimated probabilities indicated with “Without weighting”), which are further weighted according to the time at which they were generated (the estimated probailities indicated with “With weighting”).

Furthermore, in some embodiments, the state of entities may evolve over time, and consequently, the root cause probability of each entity may also change with each new state or a new piece of information.

At step S207 of the procedure 200 shown in FIG. 2, the device 100 estimates the root cause probability for each group of alarms 131, 132 and further identifies the incident 141.

The incident 141 is a group of alarms, for which the estimated root cause probability has the highest value.

Moreover, the device 100 may use a priority information related to an entity for estimating the root cause probability for that entity in a current group of alarms, or in a current state, to be the root cause of an incident. The priority information may be an estimated prior root cause probability. For example, in some embodiments, the obtained data 110 can be collected over time about the root cause probability that an entity is the root cause of an incident. This information may also be collected from a domain expert. Moreover, the device 100 may augment the root cause probability of an entity based on the current state of the communication network 1 with the a- priori root cause probabilities using a probability boosting function. In some embodiments, the device 100 may further train the probability boosting function, when a ground truth is collected about past incidents.

This can also be applied (e.g., by the device 100) as a boosting factor onto the current transition matrix probabilities. For example, it can be applied onto each entity that has historical information (e.g., has a root cause probability) for being a root cause of an incident in the communication network. Moreover, this information may be obtained as independent and/or inter-dependent knowledge.

An example for the independent knowledge case is when that rate of failure of an entity is captured across all failures encountered. Moreover, an example for the inter-dependent knowledge case is when knowledge is captured about the probability of an entity being the root cause, given that it is grouped with certain other node in the same incident.

Reference is now made to FIG. 8, which is a schematic view of a diagram illustrating the device 100 identifying an incident in an incident management system.

In diagram 800 of FIG. 8, the device 100 is, as an example, integrated within a communication network that is an incident management system. The diagram of FIG. 8 represents the integration of the device 100 into the incident management system and the visualization of the information made available by the device 100 to a Site Reliability Engineer, who may use its output.

For example, the device 100 obtains data 110. The data 110 may comprise incident lists, time range intervals at which an incident is active, and a number of alarms associated with the incidents. Moreover, upon selection of one or more alarms, the device 100 obtains information related to the alarms, including the entities associated to the alarms. This information may be represented in a view 810.

Furthermore, the device 100 may obtain sub-graph G^' 410. The sub-graph G^' 410 may be obtained based on the alarms of the selected incident(s). The sub-graph G^' 410 further shows the number of alarms associated with the entities, and, in case of the leaf node, the root cause probability for that leaf to be the root cause of the incident.

The device 100 may further identify an incident 141. As an example, for the sake of simplicity, a bar graph is presented that shows the entities sorted by their respective root cause probability of being the root cause of the incident. This examplary view enhances the human understanding of the scale of the difference between the different proposed root cause entities.

FIG. 9 shows a method 900 for monitoring a communication network according to an embodiment of the disclosure. The method 900 may be carried out by the device 100, as it is described above.

The method 900 comprises a step S901 of obtaining data 110 including topology information 111, wherein the topology information 111 is indicative of a plurality of entities of the communication network 1 and one or more interactions between some or all of the plurality of entities.

The method 900 further comprises a step S902 of obtaining a plurality of alarms 121, 122, 123, wherein each alarm 121, 122, 123 is associated with at least one of the plurality of entities.

The method 900 further comprises a step S903 of correlating the plurality of alarms 121, 122, 123 into one or more groups of alarms 131, 132, wherein each group of alarms 131, 132 is associated with a subset of the data that includes a subset of the topology information.

The method 900 further comprises a step S904 of identifying one or more incidents 141 from the one or more groups of alarms 131, 132, based on an estimation of a root cause probability for each group of alarms 131, 132 according to its associated subset of the data. For example, the estimation of the root cause probability of a given group of alarms 131, 132 represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network 1.

The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed disclosure, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A device (100) for monitoring a communication network (1), the device (100) being configured to: obtain data (110) including topology information (111), wherein the topology information (111) is indicative of a plurality of entities of the communication network (1) and one or more interactions between some or all of the plurality of entities; obtain a plurality of alarms (121, 122, 123), wherein each alarm (121, 122, 123) is associated with at least one of the plurality of entities; correlate the plurality of alarms (121, 122, 123) into one or more groups of alarms (131, 132), wherein each group of alarms (131, 132) is associated with a subset of the data that includes a subset of the topology information; and identify one or more incidents (141) from the one or more groups of alarms (131, 132), based on an estimation of a root cause probability for each group of alarms (131, 132) according to its associated subset of the data.

2. The device (100) according to claim 1, further configured to: determine one or more subsets of the data based on the one or more groups of alarms (131, 132) and the obtained data (110).

3. The device (100) according to claim 1 or 2, further configured to: obtain hierarchical-structured data (310) based on the obtained data (110) according to one or more criteria, wherein the plurality of entities in the hierarchical-structured data (310) have hierarchical dependency relationships, wherein the hierarchical-structured data (310) comprises a plurality of links, and wherein each link represents one or more hierarchical dependency relationships between the plurality of entities of the communication network (1).

4. The device (100) according to claim 3, further configured to: obtain one or more subsets (410) of the hierarchical-structured data (310), based on the hierarchical-structured data (310) and the one or more groups of alarms (131, 132), wherein each subset (410) of the hierarchical-structured data (310) comprises a first entity (411) having at least one hierarchical dependency relationship to at least one other entity, a second entity (412) having no hierarchical dependency relationship to another entity, and a third entity (413) located between the first entity and the second entity.

5. The device (100) according to claim 4, further configured to: determine a number of alarms associated with each of the first entity (411), the second entity (412), and the third entity (413) of at least one subset of the hierarchical- structured data, based on the one or more groups of alarms (131, 132).

6. The device (100) according to any one of the claims 1 to 5, further configured to: obtain an additional alarm associated with at least one of the plurality of entities; and add the additional alarm to a group from the one or more groups of alarms or to a new group.

7. The device (100) according to claim 6, further configured to: adjust, when adding the additional alarm to the group from the one or more groups of alarms, the estimation of the root cause probability for that group.

8. The device (100) according to one of the claims 1 to 7, further configured to: correlate the plurality of alarms (121, 122, 123) into the one or more groups of alarms

(131, 132) based further on temporal information (111) of the communication network (1).

9. The device (100) according to claim 8, further configured to: determine, based on the temporal information (111) of the communication network (1), one or more time intervals, at which the plurality of alarms (121, 122, 123) are generated.

10. The device (100) according to claim 9, when depending on claim 4, further configured to estimate, for each of the time intervals, an interval probability by: determining one or more possible paths connecting the second entity (412) and the first entity (411) in the subset (410) of the hierarchical-structured data (310); and determining alarms associated with entities arranged in the one or more possible paths.

11. The device (100) according to claim 10, when depending on claim 4, further configured to: estimate the root cause probability for each group (131, 132) associated to a subset (410) of the hierarchical- structured data (310) based on the determined interval probabilities of that subset (410) of the hierarchical-structured data (310).

12. The device (100) according to one of the claims 8 to 11, when depending on claim 4, further configured to: estimate, based on the temporal information of the communication network (1), a prior root cause probability for at least one entity in at least one subset (410) of the hierarchical- structured data (310); and determine, based on the estimated prior root cause probability, a root cause probability for the at least one entity in a current group to be a root cause of an incident.

13. The device (100) according to one of the claims 8 to 12, further configured to: determine, based on the temporal information of the communication network (1), a temporal weighting function for the one or more groups of alarms (131, 132); and apply the determined temporal weighting function to the estimation of the root cause probability of at least one group of alarms (131, 132).

14. A method (900) for monitoring a communication network (1), the method (900) comprising: obtaining (S901) data (110) including topology information (111), wherein the topology information (111) is indicative of a plurality of entities of the communication network (1) and one or more interactions between some or all of the plurality of entities; obtaining (S902) a plurality of alarms (121, 122, 123), wherein each alarm (121, 122, 123) is associated with at least one of the plurality of entities; correlating (S903) the plurality of alarms (121, 122, 123) into one or more groups of alarms (131, 132), wherein each group of alarms (131, 132) is associated with a subset of the data that includes a subset of the topology information; and identifying (S904) one or more incidents (141) from the one or more groups of alarms (131, 132), based on an estimation of a root cause probability for each group of alarms (131, 132) according to its associated subset of the data.

15. A computer program product comprising instructions, which, when executed by a computer, cause the method (900) of claim 14 to be performed.