WO2021249629A1 - Device and method for monitoring communication networks - Google Patents

Device and method for monitoring communication networks Download PDF

Info

Publication number
WO2021249629A1
WO2021249629A1 PCT/EP2020/065990 EP2020065990W WO2021249629A1 WO 2021249629 A1 WO2021249629 A1 WO 2021249629A1 EP 2020065990 W EP2020065990 W EP 2020065990W WO 2021249629 A1 WO2021249629 A1 WO 2021249629A1
Authority
WO
WIPO (PCT)
Prior art keywords
alarms
entity
hierarchical
entities
subset
Prior art date
Application number
PCT/EP2020/065990
Other languages
French (fr)
Inventor
Cristian-Alexandru Olariu
MingXue Wang
Peng Hu
Hitham Ahmed Assem Aly SALAMA
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2020/065990 priority Critical patent/WO2021249629A1/en
Publication of WO2021249629A1 publication Critical patent/WO2021249629A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies

Definitions

  • the present disclosure relates generally to communications networks, and particularly to monitoring communication networks.
  • a device and a method for monitoring a communication network are disclosed.
  • the disclosed device and method may support performing a Root Cause Analysis (RCA), and/or identifying an incident or a root cause of a problem.
  • RCA Root Cause Analysis
  • communication networks e.g., telecommunication networks
  • communication networks are vulnerable to problems (such as faults and/or incidents) that may occur, for example, due to hardware or software configurations, or changes in the communication networks, etc.
  • Conventional devices and methods for identifying incidents or performing RCA are based on monitoring a performance and health of large-scale distributed heterogeneous computing systems at various locations (e.g., physical machine logs, software stack traces, etc.).
  • the monitoring process may require data (such as numerical data, textual data, etc.) from the whole system.
  • the collected data may be used to extract insights into the health state of the system, and this is achieved mainly through raising alarms when the system behaves differently than expected.
  • the number of alarms raised by such a system may fall under the category of big data. It is generally desirable to improve the process of finding an incident or root cause in a vast amount of alarms generated at a big data scale.
  • embodiments of the present disclosure aim to improve conventional devices and methods for monitoring a communication network.
  • An objective is to provide a device and a method that can identify an incident or a root cause of a problem in the network.
  • Another objective is to provide a device and method that can obtain a dataset from the communication network and use it for efficiently identifying an incident or a root cause of a problem in the communication network.
  • Another objective is to provide a device and method that can provide, as an output, an identified incident or perform an RCA for a problem.
  • a first aspect of the present disclosure provides a device for monitoring a communication network, the device being configured to obtain data including topology information, wherein the topology information is indicative of a plurality of entities of the communication network and one or more interactions between some or all of the plurality of entities.
  • the device is further configured to obtain a plurality of alarms, wherein each alarm is associated with at least one of the plurality of entities, correlate the plurality of alarms into one or more groups of alarms, wherein each group of alarms is associated with a subset of the data that includes a subset of the topology information, and identify one or more incidents from the one or more groups of alarms, based on an estimation of a root cause probability for each group of alarms according to its associated subset of the data.
  • the estimation of the root cause probability of a given group of alarms represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network.
  • the device may be, or may be incorporated in, an electronic device such as a computer, a personal computer (PC), a tablet, a laptop, a network entity, a server computer, a client device, etc.
  • an electronic device such as a computer, a personal computer (PC), a tablet, a laptop, a network entity, a server computer, a client device, etc.
  • the communication network may comprise the plurality of entities that may interact with each other.
  • the plurality of entities of the communication network may comprise any network entity, such as a physical entity or a logical entity, or a network node, or a network element of the communication network.
  • a physical entity may be a server, a router, or a switch in the communication network.
  • a logical entity may be a logically separate entity with a well- defined functionality in the communication network, like a network function.
  • the device may obtain the data including the topology inforamtion related to the plurality of entities and the interactions between entities.
  • the data may be obtained directly from the communication network, or it may be obtained indirectly from a monitoring system.
  • the plurality of alarms may be obtaind by the device based on an alarming system flagging abnormal behaviors in the communication network, or by the monitoring system capturing information about the alarms and their associated entities, in the communication network.
  • the device of the first aspect may perform a root cause analysis (e.g., a temporal graph-based root cause analysis) to detect an entity, which is the cause of the abnormal event that triggered a chain of alarms (e.g., included in the plurality of alarms) in the communication network.
  • a root cause analysis e.g., a temporal graph-based root cause analysis
  • the device may identify an incident (e.g., identify an entity responsible for a fault) by leveraging interaction events logged across the plurality of entities of the communication network.
  • the device may store obtained data in a graph, and perform the RCA process on this kind of data embedding structures or the graphs.
  • the device may include, in the RCA, entities that did not raise any alarm, but may act as dependency links between active problematic entities.
  • the device of the first aspect may address the problem of Site Reliability Engineers (SRE) by starting an investigation into the root cause of an incident. For example, the device may estimate root cause probabilities, in order to identify entities (e.g., network nodes of the communication network) that are more likely to be the root cause of an incident or the root cause of a problem.
  • entities e.g., network nodes of the communication network
  • the device is further configured to determine one or more subsets of the data based on the one or more groups of alarms and the obtained data.
  • the device is further configured to obtain hierarchical-structured data based on the obtained data according to one or more criteria, wherein the plurality of entities in the hierarchical- structured data have hierarchical dependency relationships, wherein the hierarchical- structured data comprises a plurality of links, and wherein each link represents one or more hierarchical dependency relationships between the plurality of entities of the communication network.
  • the device may obtain topology information that may indicate the entities (nodes, such as logical nodes or physical nodes) and their interactions. Such interactions are usually done via interfaces.
  • the device may obtain a graph, in which the connections are represented using directed edges, so as to capture the bi directionality of the interactions.
  • the raised alarms (which may contain timestamps) may be associated directly or indirectly with one of the entities (nodes) in the graph.
  • the device may group the alarms using temporal (e.g., using the timestamps) and topological information.
  • the device is further configured to obtain one or more subsets of the hierarchical-structured data, based on the hierarchical- structured data and the one or more groups of alarms, wherein each subset of the hierarchical-structured data comprises a first entity having at least one hierarchical dependency relationship to at least one other entity, a second entity having no hierarchical dependency relationship to another entity, and a third entity located between the first entity and the second entity.
  • the subsets of the hierarchical- structured data may be, for example, a sub-graph.
  • the groups of alarms may affect a subset of the graph, and the device may use that sub-graph to perform the estimation of the root cause probability.
  • the sub-graphs may be processed to extract roots, leaves and nodes in-between.
  • the device is further configured to determine a number of alarms associated with each of the first entity, the second entity, and the third entity of at least one subset of the hierarchical- structured data, based on the one or more groups of alarms.
  • each of the entities may be associated with a number of alarms.
  • the sub-graph may be traversed from all root entities to that leaf entity, across all possible paths.
  • the device may accumulate the number of alarms, which can be considered as affected by that leaf entity. This process may be repeated for all leaf entities, and, at the end, the sums are normalized and root cause probabilities may be extracted from the normalized sums.
  • the device is further configured to obtain an additional alarm associated with at least one of the plurality of entities, and add the additional alarm to a group from the one or more groups of alarms or to a new group.
  • the device is further configured to adjust, when adding the additional alarm to the group from the one or more groups of alarms, the estimation of the root cause probability for that group.
  • the device may include information related to the temporality of incident build up and may further adjust the root cause probabilities with each newly added alarm to a group of alarms.
  • the device is further configured to correlate the plurality of alarms into the one or more groups of alarms based further on temporal information of the communication network.
  • the device is further configured to determine, based on the temporal information of the communication network, one or more time intervals, at which the plurality of alarms are generated. In a further implementation form of the first aspect, the device is further configured to estimate, for each of the time intervals, an interval probability by determining one or more possible paths connecting the second entity and the first entity in the subset of the hierarchical- structured data, and determining alarms associated with entities arranged in the one or more possible paths.
  • the device is further configured to estimate the root cause probability for each group associated to a subset of the hierarchical-structured data based on the determined interval probabilities of that subset of the hierarchical-structured data.
  • the device is further configured to estimate, based on the temporal information of the communication network, a prior root cause probability for at least one entity in at least one subset of the hierarchical-structured data, and determine, based on the estimated prior root cause probability, a root cause probability for the at least one entity in a current group to be a root cause of an incident.
  • the device is further configured to determine, based on the temporal information of the communication network, a temporal weighting function for the one or more groups of alarms, and apply the determined temporal weighting function to the estimation of the root cause probability of at least one group of alarms.
  • a second aspect of the disclosure provides a method for monitoring a communication network, the method comprising obtaining data including topology information, wherein the topology information is indicative of a plurality of entities of the communication network and one or more interactions between some or all of the plurality of entities, obtaining a plurality of alarms, wherein each alarm is associated with at least one of the plurality of entities, correlating the plurality of alarms into one or more groups of alarms, wherein each group of alarms is associated with a subset of the data that includes a subset of the topology information, and identifying one or more incidents from the one or more groups of alarms, based on an estimation of a root cause probability for each group of alarms according to its associated subset of the data.
  • the estimation of the root cause probability of a given group of alarms represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network.
  • the method further comprises determining one or more subsets of the data based on the one or more groups of alarms and the obtained data.
  • the method further comprises obtaining hierarchical-structured data based on the obtained data according to one or more criteria, wherein the plurality of entities in the hierarchical- structured data have hierarchical dependency relationships, wherein the hierarchical- structured data comprises a plurality of links, and wherein each link represents one or more hierarchical dependency relationships between the plurality of entities of the communication network.
  • the method further comprises obtaining one or more subsets of the hierarchical-structured data, based on the hierarchical- structured data and the one or more groups of alarms, wherein each subset of the hierarchical-structured data comprises a first entity having at least one hierarchical dependency relationship to at least one other entity, a second entity having no hierarchical dependency relationship to another entity, and a third entity located between the first entity and the second entity.
  • the method further comprises determining a number of alarms associated with each of the first entity, the second entity, and the third entity of at least one subset of the hierarchical-structured data, based on the one or more groups of alarms.
  • the method further comprises obtaining an additional alarm associated with at least one of the plurality of entities, and adding the additional alarm to a group from the one or more groups of alarms or to a new group.
  • the method further comprises adjusting, when adding the additional alarm to the group from the one or more groups of alarms, the estimation of the root cause probability for that group.
  • the method further comprises correlating the plurality of alarms into the one or more groups of alarms based further on temporal information of the communication network.
  • the method further comprises determining, based on the temporal information of the communication network, one or more time intervals, at which the plurality of alarms are generated.
  • the method further comprises estimating, for each of the time intervals, an interval probability by determining one or more possible paths connecting the second entity and the first entity in the subset of the hierarchical-structured data, and determining alarms associated with entities arranged in the one or more possible paths.
  • the method further comprises estimating the root cause probability for each group associated to a subset of the hierarchical-structured data based on the determined interval probabilities of that subset of the hierarchical-structured data.
  • the method further comprises estimating, based on the temporal information of the communication network, a prior root cause probability for at least one entity in at least one subset of the hierarchical-structured data, and determining, based on the estimated prior root cause probability, a root cause probability for the at least one entity in a current group to be a root cause of an incident.
  • the method further comprises determining, based on the temporal information of the communication network, a temporal weighting function for the one or more groups of alarms, and applying the determined temporal weighting function to the estimation of the root cause probability of at least one group of alarms.
  • a third aspect of the present disclosure provides a computer program comprising a program code for performing the method according to the second aspect or any of its implementation forms.
  • a fourth aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the second aspect or any of its implementation forms to be performed.
  • FIG. 1 depicts a schematic view of a device for monitoring a communication network, according to an embodiment of the disclosure
  • FIG. 2 depicts a schematic view of a flowchart of a procedure for identifying an incident based on an estimation of a root cause probability
  • FIG. 3 depicts a schematic view of a diagram illustrating an example of obtained hierarchical-structured data
  • FIG. 4 depicts a schematic view of a diagram illustrating an example of a subset of the hierarchical-structured data obtained based on a group of alarms
  • FIG. 5A-5B depict schematic views of a numerical example used for obtaining the interval probabilities (FIG. 5A), and a subgraph (FIG. 5B);
  • FIG. 6A-6B depict schematic views of diagrams illustrating connected leaf entities and a single entity and leaf connected;
  • FIG. 7 depicts a schematic view of a diagram illustrating an example of applying a temporal weighting function to the estimation of the root cause probability of a group of alarms
  • FIG. 8 depicts a schematic view of a diagram illustrating the device identifying an incident in an incident management system.
  • FIG. 9 depicts a schematic view of a flowchart of a method for monitoring a communication network, according to an embodiment of the disclosure.
  • FIG. 1 depicts a schematic view of a device 100 for monitoring a communication network 1 according to an embodiment of the invention.
  • the device 100 may be, or may be incorporated in, an electronic device, for example, a computer, a laptop, a network entity, etc.
  • the device 100 is configured to obtain data 110 including topology information 111.
  • the topology information 111 is indicative of a plurality of entities of the communication network 1.
  • the plurality of entities of the communication network 1 may comprise physical entities and/or logical entities, without limiting the present disclosure in that regard.
  • the topology information 111 is further indicative of one or more interactions between some or all of the plurality of entities.
  • the interactions may comprise any interaction that occurs between the entities of the communication network 1.
  • the device 100 may obtain the data 110, and may further parse each interaction that occurred in the communication system 1.
  • the device 100 is further configured to obtain a plurality of alarms 121, 122, 123.
  • Each alarm 121, 122, 123 is associated with at least one of the plurality of entities.
  • an entity of the communication network may raise an alarm (e.g., the alarm may be raised when the communication network behaves differently than expected).
  • the raised alarm may be associated with the entity that raised the alarm.
  • a raised alarm may be associated with more than one entity.
  • a raised alarm may be associated, directly, with an entity that raised the alarm, and it may further be associated, indirectly, with an entity that did not raise the alarm.
  • the device 100 is further configured to correlate the plurality of alarms 121, 122, 123 into one or more groups of alarms 131, 132. Moreover, each group of alarms 131, 132 may be associated with a subset of the data that includes a subset of the topology information.
  • the device 100 is configured to identify one or more incidents 141 from the one or more groups of alarms 131, 132, based on an estimation of a root cause probability for each group of alarms 131, 132 according to its associated subset of the data.
  • the alarms 122 and 123 may be correlated to one group of alarms 132.
  • the alarm 122 may be associated with a subset of data from the obtained data 110 that is affected by the alarm 122.
  • the alarm 123 may be associated with a subset of data from the obtained data 110 that is affected by the alarm 123.
  • the subset of data including the topology information associated with the alarms 122 and 123 may be used for the estimation of the root cause probability of the group of alarms 132.
  • the estimation of the root cause probability of a given group of alarms 131, 132 represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network 1.
  • the device 100 may be able to identify an incident from the one or more groups of alarms 131, 132.
  • the device 100 may optionally have a decision unit 140, which may identify the incident.
  • the device 100 may comprise a processing circuitry (not shown in FIG. 1) configured to perform, conduct or initiate the various operations of the device 100 described herein.
  • the processing circuitry may comprise hardware and software.
  • the hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry.
  • the digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field- programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors.
  • the processing circuitry comprises one or more processors and a non- transitory memory connected to the one or more processors.
  • the non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the device 100 to perform, conduct or initiate the operations or methods described herein.
  • FIG. 2 depicts a schematic view of a flowchart of a procedure 200 for identifying an incident 141 based on an estimation of a root cause probability.
  • each node of the communication network has a name (e.g., an identifier). Further, the interactions between the nodes are performed via interfaces that are referred to as edges, without limiting the present disclosure in that regard.
  • the device 100 obtains data 110 and parses interactions between the plurality of entities.
  • the nodes of the communication network 1 interact with each other.
  • the interactions may comprise sending or receiving a request in the communication network 1.
  • the device may parse each interaction event that happens in the communication network 1. Parsing the interactions may comprise detecting the nodes (e.g., deriving names of nodes involved in the interactions) that are the originator and the receiver of a request.
  • each request may be done via an interface, and there may be multiple interfaces between the same two nodes.
  • the interfaces may comprise any known interface, and may have a name (e.g., an identifier such as il, i2, etc.).
  • the device 100 may use the names of the nodes and the names of the interfaces to build a topology of the communication network (e.g., at step S202).
  • the topology may be a graph-representation of the communication network, in which the interfaces (e.g., the names of the interfaces) are used as the edges, in order to represent the interactions between the nodes.
  • the device 100 obtains hierarchical-structured data 310.
  • the hierarchical-structured data 310 is graph- structured data that is referred to as G in FIG. 2.
  • G may be a master graph or a master topology that indicates the overall topology of the communication network 1.
  • the device 100 may initialize a procedure comprising producing an empty graph. Further, as new interaction events are processed during step S201, the device 100 adds the names of the nodes and the names of the interfaces as edges to the graph G. For example, the device 100 may build a representation of the topology of the communication network based on the interactions observed in the event logs across the monitored period. The representation of the topology may be static and may not consider the temporality aspects or the time information of the occurred interactions in the graph G.
  • a first entity e.g., a node A
  • a second entitty e.g., a node B
  • a directional connection is shown as an arrow having a direction and being outbound from the first entity and being inbound to the second entity.
  • the device 100 obtains the hierarchical-structured data 310 that may be referred to as master graph or master topology graph G , which is a multi-directional graph representing the overall topology of the communication network.
  • master graph or master topology graph G is a multi-directional graph representing the overall topology of the communication network.
  • An example of graph G is depicted in FIG. 3.
  • FIG. 3 is a schematic view of an example of obtained hierarchical-structured data 310.
  • the device 100 obtains hierarchical -structured data 310 based on the obtained data 110.
  • the hierarchical-structured data 310 shown in FIG. 3 is indicative of the overall topology of the communication network.
  • the device may peform a mapping procedure for obtaining the hierarchical-structured data 310.
  • an exemplary mapping procedure is disscussed for obtaining the hierarchical structured data that is a representation of the communication network.
  • the central node in FIG. 3 acts as a root entity for the whole communication network.
  • the plurality of entities of the communication network are based on a root entity, external entities, and internal entities.
  • An external entity is an entity that requests resources.
  • An internal entity is an entity that provide resources.
  • the root entity (the central node shown in FIG. 3) represents the main contact point from external entities that need access to resources managed by the internal entities, depicted as outbounding arrows (radiating out) from the root node.
  • each arrow depicts the dependency, for example, where the arrow points out from the dependent entity to another entity, it is fulfilling any request on behalf of its parent entity (the entity that initiated a request for the resources).
  • Examples of such architectures are content-delivery networks, micro-service platforms, web-services, basic routing networks, etc.
  • the device 100 correlates the plurality of alarms 121, 122, 123 into groups of alarms 131, 132.
  • the plurality of alarms may be grouped based on temporal information (e.g., using timestamps) and topological information (e.g., the hierarchical- structured data 310 constructed from the interactions between the entities at step S202).
  • Each group of alarms may affect a subset of the hierarchical-structured data 310 (i.e., a subset of the whole graph shown in FIG. 3), and the device 100 uses that affected subset of the hierarchical-structured data (hereinafter referred to as “the sub-graph”) to estimate the root cause probability for that group of alarms.
  • the group of alarms may be identified as an incident in the communication network 1.
  • the device 100 may further determine the root cause of the identified accident.
  • the device 100 builds a sub-graph (referred to as G’ in FIG. 2).
  • the sub-graph G’ is a subset of the hierarchical-structured data 310 that is affected by an alarm.
  • the step S204 may be invoked each time a new group of alarms is created, for example, each time that a new group of alarms is created at step S203, the device 100 invokes step S204 to build a sub-graph for that newly created group of alarms.
  • the device 100 may parse the alarms and may extract the timestamp and the entity (i.e., the name of the entities) from each alarm in the incoming stream of alarms.
  • the obtained alarms may indicate the name of the entities, in which the alarms are originated, or alternatively, the device 100 may use a natural text analytics to match the alarms’ context to one of the entities of the communication (e.g., to one of the entities in the graph G or n_i e G).
  • the matching entity may be appended to a list of matching entities, named L.
  • the device 100 may also store the number of times the same entity is matched to one of the alarms. Moreover, the device 100 may obtain (by parsing all alarms in the groups of alarms) a list of all entities and their corresponding alarm count, and timestamps of raising each alarm.
  • the groups of alarms 131, 132 may be updated upon the receipt of new alarms, and the step S204 may be triggered during each updating process.
  • FIG. 4 is a schematic view of an example of a subset of the hierarchical-structured data 410.
  • the subset of the hierarchical-structured data 410 shown in FIG. 4 is a subgraph of the hierarchical-structured data 310 and is obtained after correlating the alarms and obtaining the group of alarms.
  • the device 100 obtains the subset of the hierarchical-structured data 410, or the sub-graph shown in FIG. 4, by constructing a new graph, named G’, based on the list of matched entities to an alarm, and by using the hierarchical-structured data 310 (master topology graph G) to extract the links and their directions.
  • the resulting sub-graph G’ embeds the entities currently affected by the alarms, and the magnitude of the impact which is represented by the number of alarms.
  • “tm-behavior#4” represents four alarms that are associated with the entity having the name “tm-behavior”.
  • the sub-graph G’ also depicts the hierarchy of how the incident’s alarms propagated in the communication network.
  • a given cause propagates from the origins (leaf nodes 412) upstream, and affects other nodes which rely on the services of the leaf nodes (root nodes 411 and intermediary nodes 413).
  • the entity having the name of “tm-order’5” represents an extracted node name that does not match any node in the gaph G, and has no contribution to the root cause analysis of the entities.
  • the subgraph shown in FIG. 4 is a subset of the hierarchical-structured data 410 that includes two first entities 411, referred to as “tm-behavior#4” (shown as a root entity having hierarchical dependency relationship to other entities) and “tm-order#5”.
  • the hierarchical- structured data 410 of FIG. 4 further includes two second entities 412 (each shown as a leaf having no hierarchical dependency relationship to another entity) referred to as “ESGE#2” and “tm- operation#6”, and one third entity 413 shown as an intermediary entity and referred to as “tm- search#6”.
  • the device 100 estimates, for each of the time intervals, an interval probability (also referred to as leaf probability in FIG. 2).
  • the device 100 may determine the time intervals at which the plurality of alarms 121, 122, 123 are generated. Moreover, for each of the time intervals, the device 100 may estimate its respective interval probability by determining the possible paths connecting the second entity 412 and the first entity 411 in the subset of the hierarchical-structured data 410. Furthermore, the device 100 may determine the number of alarms that are associated with the first entity 411, the second entity 412, and the third entity 413 in each possible path.
  • FIG. 5A and FIG. 5B are schematic views of a numerical example for obtaining the interval probabilities (shown in diagram 500A of FIG. 5 A) for the subgraph (the subset of hierarchical- structured data 410) shown in FIG. 5B.
  • the diagram 500A of FIG 5 A may be performed by the device 100 at step S205 of the procedure 200 shown in FIG. 2.
  • the device 100 may repeat this process at many time intervals the group of alarms is split in. That means, with each new alarm received by the communication network, the representation of the communication network in the hierarchical- structured data 310 or the subset of hierarchical-structured data 410 (the sub-graph) may be updated with the entity that is associated with the recieved alarm, and the following procedure may be repeated.
  • the sub-graph G’ is a subset of hierarchical-structured data 410 that comprises the entities 411, 412, 413, their multi-directional links, and the number of associated alarms.
  • the entity 411 in the sub-graph 410 has a name of “tm-apigw” and a number of 138 alarms are associated with the entity 411.
  • the sub-graph G’ 410 shown in FIG. 5B may be parsed by the device 100, as follows:
  • the device 100 may determine for each second entity 412 (leaf entity), all the possible paths between that leaf enity and all the first entities 411 (root entities) of the subgraph 410.
  • the device 100 may pass each entity in each possible path, and may accumulate the number of alarms that are associated with that passed entity.
  • the device 100 may add, only once, the number of alarms associated with the current second entity 412 (leaf node) that is under the process of caluclating the interval probability (leaf probability).
  • the result of the above process may yield a number of associated alarms for each second entity 412 (leaf entity). This sum is proportional with the impact of each leaf entity to the overall incident.
  • the entity referred to as “tm-odp#7” is affected by 76 alarms and the entity referred to as “tm-recommend’18” is affected by 302 alarms.
  • each leaf entity’s sum can be normalized by the grand total sum across all leaf entities, and that may yield a probability estimation for the root cause probability of the entity and/or the root cause probability of the group of alarms.
  • the root cause probability of the entity referred to as “tm-odp#7” is 0.2
  • the root cause probability of the entity referred to as ““tm- recommend’ 18” is 0 8
  • the device 100 may further perform a procedure for an augmentation of missing entities.
  • the device 100 may use the dynamic topology of historical interactions (captured in the hierarchical-structured data or G) to augment the incident graph (the subset of hierarchical- structured data or G’), which is a sub-graph of the whole entity graph G. For example, the device 100 may search the larger graph for determining connections between leafs in G ' that include nodes that did not raise an alarm.
  • FIG. 6A An example is shown in diagram 600 A of FIG. 6 A and diagram 600B of FIG. 6B, illustrating connected leaf entities (FIG. 6A) and a single entity and leaf connected (FIG. 6B).
  • the entity referred to as “tm-search#2” 611 has dependency connections to the entity referred to as “tm-order#l” 612.
  • any alarm raised by the entity “tm-order#l” 612 may potentially affect the entity “tm-search#2” 611. Therefore, the “tm- order#l” entity’s affected alarm count will include paths from the root to it that traverse the “tm-search#2” entity 611 as well.
  • the entity referred to as “tm-order#l” 612 is suggested as the root cause of an incident.
  • the entity referred to as “tm-search#6” 611 has no dependency connections to the entity referred to as “tm-order#5” 612. In this case, the connection disscussed with respect to FIG. 6A cannot be considered. Therefore, without considering the connection from the the “tm-search#6” entity to the “tm-topic#0” entity and further to the “tm-order#l” entity, the entity referred to as “tm-search#6” may be identified as the root cause of the incident.
  • the device 100 may determine a temporal weighting function, and further apply the determined temporal weighting function to the estimation of the root cause probability of a group of alarms.
  • alarms may be generated at different times.
  • the device 100 when correlating the plurality of alarms, for obtaining groups of alarms, may add these alarms either to an existing incident, or to a new incident.
  • the root cause probabilities may further be adjusted.
  • the device 100 may further obtain a state transition matrix (S.T.M) comprising the states of the generated incidents in the communication network 1.
  • S.T.M state transition matrix
  • the device 100 may compute the root cause probabilities, and may further compute a timeline of probabilities.
  • the device 100 may apply the temporal weighting function to these timeline of probabilities.
  • the device 100 may learn (e..g, it may derive) the temporal weighting function from historical prior interactions (e.g., typical time information between raised alarms, typical duration time between the first and the last alarm in an incident, etc.).
  • an exponential weighting function may be used, however, generally, the device 100 may use any weighting function.
  • FIG. 7 shows a diagram 700 illustrating an example of applying a temporal weighting function to the estimation of the root cause probability of a group of alarms.
  • the device 100 recomputes (estimates again) the root cause probabilities (the estimated probabilities indicated with “Without weighting”), which are further weighted according to the time at which they were generated (the estimated probailities indicated with “With weighting”).
  • the state of entities may evolve over time, and consequently, the root cause probability of each entity may also change with each new state or a new piece of information.
  • the device 100 estimates the root cause probability for each group of alarms 131, 132 and further identifies the incident 141.
  • the incident 141 is a group of alarms, for which the estimated root cause probability has the highest value.
  • the device 100 may use a priority information related to an entity for estimating the root cause probability for that entity in a current group of alarms, or in a current state, to be the root cause of an incident.
  • the priority information may be an estimated prior root cause probability.
  • the obtained data 110 can be collected over time about the root cause probability that an entity is the root cause of an incident. This information may also be collected from a domain expert.
  • the device 100 may augment the root cause probability of an entity based on the current state of the communication network 1 with the a- priori root cause probabilities using a probability boosting function. In some embodiments, the device 100 may further train the probability boosting function, when a ground truth is collected about past incidents.
  • An example for the independent knowledge case is when that rate of failure of an entity is captured across all failures encountered.
  • an example for the inter-dependent knowledge case is when knowledge is captured about the probability of an entity being the root cause, given that it is grouped with certain other node in the same incident.
  • FIG. 8 is a schematic view of a diagram illustrating the device 100 identifying an incident in an incident management system.
  • the device 100 is, as an example, integrated within a communication network that is an incident management system.
  • the diagram of FIG. 8 represents the integration of the device 100 into the incident management system and the visualization of the information made available by the device 100 to a Site Reliability Engineer, who may use its output.
  • the device 100 obtains data 110.
  • the data 110 may comprise incident lists, time range intervals at which an incident is active, and a number of alarms associated with the incidents.
  • the device 100 upon selection of one or more alarms, obtains information related to the alarms, including the entities associated to the alarms. This information may be represented in a view 810.
  • the device 100 may obtain sub-graph G ' 410.
  • the sub-graph G ' 410 may be obtained based on the alarms of the selected incident(s).
  • the sub-graph G ' 410 further shows the number of alarms associated with the entities, and, in case of the leaf node, the root cause probability for that leaf to be the root cause of the incident.
  • the device 100 may further identify an incident 141.
  • an incident 141 As an example, for the sake of simplicity, a bar graph is presented that shows the entities sorted by their respective root cause probability of being the root cause of the incident. This examplary view enhances the human understanding of the scale of the difference between the different proposed root cause entities.
  • FIG. 9 shows a method 900 for monitoring a communication network according to an embodiment of the disclosure.
  • the method 900 may be carried out by the device 100, as it is described above.
  • the method 900 comprises a step S901 of obtaining data 110 including topology information 111, wherein the topology information 111 is indicative of a plurality of entities of the communication network 1 and one or more interactions between some or all of the plurality of entities.
  • the method 900 further comprises a step S902 of obtaining a plurality of alarms 121, 122, 123, wherein each alarm 121, 122, 123 is associated with at least one of the plurality of entities.
  • the method 900 further comprises a step S903 of correlating the plurality of alarms 121, 122, 123 into one or more groups of alarms 131, 132, wherein each group of alarms 131, 132 is associated with a subset of the data that includes a subset of the topology information.
  • the method 900 further comprises a step S904 of identifying one or more incidents 141 from the one or more groups of alarms 131, 132, based on an estimation of a root cause probability for each group of alarms 131, 132 according to its associated subset of the data.
  • the estimation of the root cause probability of a given group of alarms 131, 132 represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network 1.

Abstract

The present disclosure relates to a device for monitoring a communication network. The device obtains data including topology information, wherein the topology information is indicative of a plurality of entities of the communication network and one or more interactions between some or all of the plurality of entities. The device further obtains a plurality of alarms, wherein each alarm is associated with at least one of the plurality of entities, and correlates the plurality of alarms into one or more groups of alarms, wherein each group of alarms is associated with a subset of the data that includes a subset of the topology information. Moreover, the device identifies one or more incidents from the one or more groups of alarms, based on an estimation of a root cause probability for each group of alarms according to its associated subset of the data.

Description

DEVICE AND METHOD FOR MONITORING COMMUNICATION NETWORKS
TECHNICAL FIELD
The present disclosure relates generally to communications networks, and particularly to monitoring communication networks. To this end, a device and a method for monitoring a communication network are disclosed. For example, the disclosed device and method may support performing a Root Cause Analysis (RCA), and/or identifying an incident or a root cause of a problem.
BACKGROUND
Generally, communication networks (e.g., telecommunication networks) include many components running in a complex environment. Moreover, communication networks are vulnerable to problems (such as faults and/or incidents) that may occur, for example, due to hardware or software configurations, or changes in the communication networks, etc.
Conventional devices and methods for identifying incidents or performing RCA are based on monitoring a performance and health of large-scale distributed heterogeneous computing systems at various locations (e.g., physical machine logs, software stack traces, etc.). The monitoring process may require data (such as numerical data, textual data, etc.) from the whole system. Further, the collected data may be used to extract insights into the health state of the system, and this is achieved mainly through raising alarms when the system behaves differently than expected.
The number of alarms raised by such a system may fall under the category of big data. It is generally desirable to improve the process of finding an incident or root cause in a vast amount of alarms generated at a big data scale.
SUMMARY
In view of the above-mentioned problems and disadvantages, embodiments of the present disclosure aim to improve conventional devices and methods for monitoring a communication network. An objective is to provide a device and a method that can identify an incident or a root cause of a problem in the network. Another objective is to provide a device and method that can obtain a dataset from the communication network and use it for efficiently identifying an incident or a root cause of a problem in the communication network. Another objective is to provide a device and method that can provide, as an output, an identified incident or perform an RCA for a problem.
The above mentioned one or more objectives are achieved by the embodiments of the disclosure as described in the enclosed independent claims. Advantageous implementations of the embodiments of the disclosure are further defined in the dependent claims.
A first aspect of the present disclosure provides a device for monitoring a communication network, the device being configured to obtain data including topology information, wherein the topology information is indicative of a plurality of entities of the communication network and one or more interactions between some or all of the plurality of entities. The device is further configured to obtain a plurality of alarms, wherein each alarm is associated with at least one of the plurality of entities, correlate the plurality of alarms into one or more groups of alarms, wherein each group of alarms is associated with a subset of the data that includes a subset of the topology information, and identify one or more incidents from the one or more groups of alarms, based on an estimation of a root cause probability for each group of alarms according to its associated subset of the data.
For example, the estimation of the root cause probability of a given group of alarms represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network.
The device may be, or may be incorporated in, an electronic device such as a computer, a personal computer (PC), a tablet, a laptop, a network entity, a server computer, a client device, etc.
The communication network may comprise the plurality of entities that may interact with each other. The plurality of entities of the communication network may comprise any network entity, such as a physical entity or a logical entity, or a network node, or a network element of the communication network. For instance, a physical entity may be a server, a router, or a switch in the communication network. A logical entity may be a logically separate entity with a well- defined functionality in the communication network, like a network function. Moreover, the device may obtain the data including the topology inforamtion related to the plurality of entities and the interactions between entities. For example, the data may be obtained directly from the communication network, or it may be obtained indirectly from a monitoring system. Furthermore, for example, the plurality of alarms may be obtaind by the device based on an alarming system flagging abnormal behaviors in the communication network, or by the monitoring system capturing information about the alarms and their associated entities, in the communication network.
The device of the first aspect may perform a root cause analysis (e.g., a temporal graph-based root cause analysis) to detect an entity, which is the cause of the abnormal event that triggered a chain of alarms (e.g., included in the plurality of alarms) in the communication network.
In some embodiments, the device may identify an incident (e.g., identify an entity responsible for a fault) by leveraging interaction events logged across the plurality of entities of the communication network. In some embodiments, the device may store obtained data in a graph, and perform the RCA process on this kind of data embedding structures or the graphs.
In some embodiments, the device may include, in the RCA, entities that did not raise any alarm, but may act as dependency links between active problematic entities.
The device of the first aspect may address the problem of Site Reliability Engineers (SRE) by starting an investigation into the root cause of an incident. For example, the device may estimate root cause probabilities, in order to identify entities (e.g., network nodes of the communication network) that are more likely to be the root cause of an incident or the root cause of a problem.
In the following, the terms “entity” and “node” are used interchangeably, without limiting the present disclosure.
In an implementation form of the first aspect, the device is further configured to determine one or more subsets of the data based on the one or more groups of alarms and the obtained data.
In a further implementation form of the first aspect, the device is further configured to obtain hierarchical-structured data based on the obtained data according to one or more criteria, wherein the plurality of entities in the hierarchical- structured data have hierarchical dependency relationships, wherein the hierarchical- structured data comprises a plurality of links, and wherein each link represents one or more hierarchical dependency relationships between the plurality of entities of the communication network.
For example, the device may obtain topology information that may indicate the entities (nodes, such as logical nodes or physical nodes) and their interactions. Such interactions are usually done via interfaces. As an example of the hierarchical- structured data, the device may obtain a graph, in which the connections are represented using directed edges, so as to capture the bi directionality of the interactions. Further, the raised alarms (which may contain timestamps) may be associated directly or indirectly with one of the entities (nodes) in the graph. Moreover, the device may group the alarms using temporal (e.g., using the timestamps) and topological information.
In a further implementation form of the first aspect, the device is further configured to obtain one or more subsets of the hierarchical-structured data, based on the hierarchical- structured data and the one or more groups of alarms, wherein each subset of the hierarchical-structured data comprises a first entity having at least one hierarchical dependency relationship to at least one other entity, a second entity having no hierarchical dependency relationship to another entity, and a third entity located between the first entity and the second entity.
The subsets of the hierarchical- structured data may be, for example, a sub-graph.
In some embodiments, the groups of alarms may affect a subset of the graph, and the device may use that sub-graph to perform the estimation of the root cause probability. For example, the sub-graphs may be processed to extract roots, leaves and nodes in-between.
In the following, when referring to the hierarchical -structured data or graphs or sub-graphs, the following terms are used interchangeably, without limiting the present disclosure:
• “root entity” and “first entity”
• “leaf entity and “second entity”
• “intermediary entity” and “third entity” and
• “leaf probability” and “interval probability”. In a further implementation form of the first aspect, the device is further configured to determine a number of alarms associated with each of the first entity, the second entity, and the third entity of at least one subset of the hierarchical- structured data, based on the one or more groups of alarms.
For example, each of the entities may be associated with a number of alarms. Further, for each leaf entity, the sub-graph may be traversed from all root entities to that leaf entity, across all possible paths. Moreover, the device may accumulate the number of alarms, which can be considered as affected by that leaf entity. This process may be repeated for all leaf entities, and, at the end, the sums are normalized and root cause probabilities may be extracted from the normalized sums.
In a further implementation form of the first aspect, the device is further configured to obtain an additional alarm associated with at least one of the plurality of entities, and add the additional alarm to a group from the one or more groups of alarms or to a new group.
In a further implementation form of the first aspect, the device is further configured to adjust, when adding the additional alarm to the group from the one or more groups of alarms, the estimation of the root cause probability for that group.
For example, the device may include information related to the temporality of incident build up and may further adjust the root cause probabilities with each newly added alarm to a group of alarms.
In a further implementation form of the first aspect, the device is further configured to correlate the plurality of alarms into the one or more groups of alarms based further on temporal information of the communication network.
In a further implementation form of the first aspect, the device is further configured to determine, based on the temporal information of the communication network, one or more time intervals, at which the plurality of alarms are generated. In a further implementation form of the first aspect, the device is further configured to estimate, for each of the time intervals, an interval probability by determining one or more possible paths connecting the second entity and the first entity in the subset of the hierarchical- structured data, and determining alarms associated with entities arranged in the one or more possible paths.
In a further implementation form of the first aspect, the device is further configured to estimate the root cause probability for each group associated to a subset of the hierarchical-structured data based on the determined interval probabilities of that subset of the hierarchical-structured data.
In a further implementation form of the first aspect, the device is further configured to estimate, based on the temporal information of the communication network, a prior root cause probability for at least one entity in at least one subset of the hierarchical-structured data, and determine, based on the estimated prior root cause probability, a root cause probability for the at least one entity in a current group to be a root cause of an incident.
In a further implementation form of the first aspect, the device is further configured to determine, based on the temporal information of the communication network, a temporal weighting function for the one or more groups of alarms, and apply the determined temporal weighting function to the estimation of the root cause probability of at least one group of alarms.
A second aspect of the disclosure provides a method for monitoring a communication network, the method comprising obtaining data including topology information, wherein the topology information is indicative of a plurality of entities of the communication network and one or more interactions between some or all of the plurality of entities, obtaining a plurality of alarms, wherein each alarm is associated with at least one of the plurality of entities, correlating the plurality of alarms into one or more groups of alarms, wherein each group of alarms is associated with a subset of the data that includes a subset of the topology information, and identifying one or more incidents from the one or more groups of alarms, based on an estimation of a root cause probability for each group of alarms according to its associated subset of the data. For example, the estimation of the root cause probability of a given group of alarms represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network.
In an implementation form of the second aspect, the method further comprises determining one or more subsets of the data based on the one or more groups of alarms and the obtained data.
In a further implementation form of the second aspect, the method further comprises obtaining hierarchical-structured data based on the obtained data according to one or more criteria, wherein the plurality of entities in the hierarchical- structured data have hierarchical dependency relationships, wherein the hierarchical- structured data comprises a plurality of links, and wherein each link represents one or more hierarchical dependency relationships between the plurality of entities of the communication network.
In a further implementation form of the second aspect, the method further comprises obtaining one or more subsets of the hierarchical-structured data, based on the hierarchical- structured data and the one or more groups of alarms, wherein each subset of the hierarchical-structured data comprises a first entity having at least one hierarchical dependency relationship to at least one other entity, a second entity having no hierarchical dependency relationship to another entity, and a third entity located between the first entity and the second entity.
In a further implementation form of the second aspect, the method further comprises determining a number of alarms associated with each of the first entity, the second entity, and the third entity of at least one subset of the hierarchical-structured data, based on the one or more groups of alarms.
In a further implementation form of the second aspect, the method further comprises obtaining an additional alarm associated with at least one of the plurality of entities, and adding the additional alarm to a group from the one or more groups of alarms or to a new group.
In a further implementation form of the second aspect, the method further comprises adjusting, when adding the additional alarm to the group from the one or more groups of alarms, the estimation of the root cause probability for that group. In a further implementation form of the second aspect, the method further comprises correlating the plurality of alarms into the one or more groups of alarms based further on temporal information of the communication network.
In a further implementation form of the second aspect, the method further comprises determining, based on the temporal information of the communication network, one or more time intervals, at which the plurality of alarms are generated.
In a further implementation form of the second aspect, the method further comprises estimating, for each of the time intervals, an interval probability by determining one or more possible paths connecting the second entity and the first entity in the subset of the hierarchical-structured data, and determining alarms associated with entities arranged in the one or more possible paths.
In a further implementation form of the second aspect, the method further comprises estimating the root cause probability for each group associated to a subset of the hierarchical-structured data based on the determined interval probabilities of that subset of the hierarchical-structured data.
In a further implementation form of the second aspect, the method further comprises estimating, based on the temporal information of the communication network, a prior root cause probability for at least one entity in at least one subset of the hierarchical-structured data, and determining, based on the estimated prior root cause probability, a root cause probability for the at least one entity in a current group to be a root cause of an incident.
In a further implementation form of the second aspect, the method further comprises determining, based on the temporal information of the communication network, a temporal weighting function for the one or more groups of alarms, and applying the determined temporal weighting function to the estimation of the root cause probability of at least one group of alarms.
A third aspect of the present disclosure provides a computer program comprising a program code for performing the method according to the second aspect or any of its implementation forms. A fourth aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the second aspect or any of its implementation forms to be performed.
It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.
BRIEF DESCRIPTION OF DRAWINGS
The above described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which
FIG. 1 depicts a schematic view of a device for monitoring a communication network, according to an embodiment of the disclosure;
FIG. 2 depicts a schematic view of a flowchart of a procedure for identifying an incident based on an estimation of a root cause probability;
FIG. 3 depicts a schematic view of a diagram illustrating an example of obtained hierarchical-structured data;
FIG. 4 depicts a schematic view of a diagram illustrating an example of a subset of the hierarchical-structured data obtained based on a group of alarms;
FIG. 5A-5B depict schematic views of a numerical example used for obtaining the interval probabilities (FIG. 5A), and a subgraph (FIG. 5B); FIG. 6A-6B depict schematic views of diagrams illustrating connected leaf entities and a single entity and leaf connected;
FIG. 7 depicts a schematic view of a diagram illustrating an example of applying a temporal weighting function to the estimation of the root cause probability of a group of alarms;
FIG. 8 depicts a schematic view of a diagram illustrating the device identifying an incident in an incident management system; and
FIG. 9 depicts a schematic view of a flowchart of a method for monitoring a communication network, according to an embodiment of the disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
FIG. 1 depicts a schematic view of a device 100 for monitoring a communication network 1 according to an embodiment of the invention.
The device 100 may be, or may be incorporated in, an electronic device, for example, a computer, a laptop, a network entity, etc.
The device 100 is configured to obtain data 110 including topology information 111. The topology information 111 is indicative of a plurality of entities of the communication network 1. The plurality of entities of the communication network 1 may comprise physical entities and/or logical entities, without limiting the present disclosure in that regard. The topology information 111 is further indicative of one or more interactions between some or all of the plurality of entities. The interactions may comprise any interaction that occurs between the entities of the communication network 1. For example, the device 100 may obtain the data 110, and may further parse each interaction that occurred in the communication system 1.
The device 100 is further configured to obtain a plurality of alarms 121, 122, 123. Each alarm 121, 122, 123 is associated with at least one of the plurality of entities. For example, an entity of the communication network may raise an alarm (e.g., the alarm may be raised when the communication network behaves differently than expected). Moreover, the raised alarm may be associated with the entity that raised the alarm. In some embodiments, a raised alarm may be associated with more than one entity. For example, a raised alarm may be associated, directly, with an entity that raised the alarm, and it may further be associated, indirectly, with an entity that did not raise the alarm.
The device 100 is further configured to correlate the plurality of alarms 121, 122, 123 into one or more groups of alarms 131, 132. Moreover, each group of alarms 131, 132 may be associated with a subset of the data that includes a subset of the topology information.
Furthermore, the device 100 is configured to identify one or more incidents 141 from the one or more groups of alarms 131, 132, based on an estimation of a root cause probability for each group of alarms 131, 132 according to its associated subset of the data.
For example, the alarms 122 and 123 may be correlated to one group of alarms 132. Moreover, the alarm 122 may be associated with a subset of data from the obtained data 110 that is affected by the alarm 122. Furthermore, the alarm 123 may be associated with a subset of data from the obtained data 110 that is affected by the alarm 123. Moreover, the subset of data including the topology information associated with the alarms 122 and 123 may be used for the estimation of the root cause probability of the group of alarms 132.
The estimation of the root cause probability of a given group of alarms 131, 132 represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network 1.
Hence, the device 100 may be able to identify an incident from the one or more groups of alarms 131, 132. For example, the device 100 may optionally have a decision unit 140, which may identify the incident.
The device 100 may comprise a processing circuitry (not shown in FIG. 1) configured to perform, conduct or initiate the various operations of the device 100 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field- programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. In one embodiment, the processing circuitry comprises one or more processors and a non- transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the device 100 to perform, conduct or initiate the operations or methods described herein.
FIG. 2 depicts a schematic view of a flowchart of a procedure 200 for identifying an incident 141 based on an estimation of a root cause probability.
In FIG. 2, it is assumed that the plurality of entities of the communication network are based on nodes that are forming the communication network, and each node of the communication network has a name (e.g., an identifier). Further, the interactions between the nodes are performed via interfaces that are referred to as edges, without limiting the present disclosure in that regard.
At step S201, the device 100 obtains data 110 and parses interactions between the plurality of entities.
For example, the nodes of the communication network 1 interact with each other. The interactions may comprise sending or receiving a request in the communication network 1. Further, the device may parse each interaction event that happens in the communication network 1. Parsing the interactions may comprise detecting the nodes (e.g., deriving names of nodes involved in the interactions) that are the originator and the receiver of a request.
Furthermore, each request may be done via an interface, and there may be multiple interfaces between the same two nodes. The interfaces may comprise any known interface, and may have a name (e.g., an identifier such as il, i2, etc.). Moreover, the device 100 may use the names of the nodes and the names of the interfaces to build a topology of the communication network (e.g., at step S202). For example, the topology may be a graph-representation of the communication network, in which the interfaces (e.g., the names of the interfaces) are used as the edges, in order to represent the interactions between the nodes.
At step S202, the device 100 obtains hierarchical-structured data 310. For example, the hierarchical-structured data 310 is graph- structured data that is referred to as G in FIG. 2. G may be a master graph or a master topology that indicates the overall topology of the communication network 1.
For instance, the device 100 may initialize a procedure comprising producing an empty graph. Further, as new interaction events are processed during step S201, the device 100 adds the names of the nodes and the names of the interfaces as edges to the graph G. For example, the device 100 may build a representation of the topology of the communication network based on the interactions observed in the event logs across the monitored period. The representation of the topology may be static and may not consider the temporality aspects or the time information of the occurred interactions in the graph G.
Moreover, as the interactions occur generally via named interfaces, there may be multiple directional connections from a first entity (e.g., a node A) to a second entitty (e.g., a node B), and each of these directional connections may be assigned a different name.
A directional connection is shown as an arrow having a direction and being outbound from the first entity and being inbound to the second entity.
Hence, the device 100 obtains the hierarchical-structured data 310 that may be referred to as master graph or master topology graph G , which is a multi-directional graph representing the overall topology of the communication network. An example of graph G is depicted in FIG. 3.
Reference is now made to FIG. 3, which is a schematic view of an example of obtained hierarchical-structured data 310.
As disscussed, the device 100 obtains hierarchical -structured data 310 based on the obtained data 110. The hierarchical-structured data 310 shown in FIG. 3 is indicative of the overall topology of the communication network. For example, the device may peform a mapping procedure for obtaining the hierarchical-structured data 310.
Next, an exemplary mapping procedure is disscussed for obtaining the hierarchical structured data that is a representation of the communication network. The central node in FIG. 3 acts as a root entity for the whole communication network. The plurality of entities of the communication network are based on a root entity, external entities, and internal entities. An external entity is an entity that requests resources. An internal entity is an entity that provide resources. Further, the root entity (the central node shown in FIG. 3) represents the main contact point from external entities that need access to resources managed by the internal entities, depicted as outbounding arrows (radiating out) from the root node. In the hierarchical- structured data 310, each arrow (line or link) depicts the dependency, for example, where the arrow points out from the dependent entity to another entity, it is fulfilling any request on behalf of its parent entity (the entity that initiated a request for the resources). Examples of such architectures are content-delivery networks, micro-service platforms, web-services, basic routing networks, etc.
At step S203, the device 100 correlates the plurality of alarms 121, 122, 123 into groups of alarms 131, 132. For example, the plurality of alarms may be grouped based on temporal information (e.g., using timestamps) and topological information (e.g., the hierarchical- structured data 310 constructed from the interactions between the entities at step S202).
Each group of alarms may affect a subset of the hierarchical-structured data 310 (i.e., a subset of the whole graph shown in FIG. 3), and the device 100 uses that affected subset of the hierarchical-structured data (hereinafter referred to as “the sub-graph”) to estimate the root cause probability for that group of alarms.
Furthermore, depending on the estimated root cause probability, the group of alarms may be identified as an incident in the communication network 1. In some embodiments, the device 100 may further determine the root cause of the identified accident.
At step S204, the device 100 builds a sub-graph (referred to as G’ in FIG. 2). For example, the sub-graph G’ is a subset of the hierarchical-structured data 310 that is affected by an alarm.
In some embodiments, the step S204 may be invoked each time a new group of alarms is created, for example, each time that a new group of alarms is created at step S203, the device 100 invokes step S204 to build a sub-graph for that newly created group of alarms. The device 100 may parse the alarms and may extract the timestamp and the entity (i.e., the name of the entities) from each alarm in the incoming stream of alarms. The obtained alarms may indicate the name of the entities, in which the alarms are originated, or alternatively, the device 100 may use a natural text analytics to match the alarms’ context to one of the entities of the communication (e.g., to one of the entities in the graph G or n_i e G).
Furthermore, once a match is established, the matching entity may be appended to a list of matching entities, named L. The device 100 may also store the number of times the same entity is matched to one of the alarms. Moreover, the device 100 may obtain (by parsing all alarms in the groups of alarms) a list of all entities and their corresponding alarm count, and timestamps of raising each alarm.
In some embodiments, the groups of alarms 131, 132 may be updated upon the receipt of new alarms, and the step S204 may be triggered during each updating process.
Reference is now made to FIG. 4, which is a schematic view of an example of a subset of the hierarchical-structured data 410.
The subset of the hierarchical-structured data 410 shown in FIG. 4 is a subgraph of the hierarchical-structured data 310 and is obtained after correlating the alarms and obtaining the group of alarms.
For example, the device 100 obtains the subset of the hierarchical-structured data 410, or the sub-graph shown in FIG. 4, by constructing a new graph, named G’, based on the list of matched entities to an alarm, and by using the hierarchical-structured data 310 (master topology graph G) to extract the links and their directions. The resulting sub-graph G’ embeds the entities currently affected by the alarms, and the magnitude of the impact which is represented by the number of alarms. For example, in the sub-graph G’, “tm-behavior#4” represents four alarms that are associated with the entity having the name “tm-behavior”.
The sub-graph G’ also depicts the hierarchy of how the incident’s alarms propagated in the communication network. In other words, a given cause propagates from the origins (leaf nodes 412) upstream, and affects other nodes which rely on the services of the leaf nodes (root nodes 411 and intermediary nodes 413). Moreover, in some embodiments, there may be a matched entity that is not connected with other matching entities, and these are referred to as single entities and are considered as roots in the sub-graph’s hierarchy. For instance, in the hierarchical-structured data 410, the entity having the name of “tm-order’5” represents an extracted node name that does not match any node in the gaph G, and has no contribution to the root cause analysis of the entities.
The subgraph shown in FIG. 4 is a subset of the hierarchical-structured data 410 that includes two first entities 411, referred to as “tm-behavior#4” (shown as a root entity having hierarchical dependency relationship to other entities) and “tm-order#5”. The hierarchical- structured data 410 of FIG. 4 further includes two second entities 412 (each shown as a leaf having no hierarchical dependency relationship to another entity) referred to as “ESGE#2” and “tm- operation#6”, and one third entity 413 shown as an intermediary entity and referred to as “tm- search#6”.
At step S205, the device 100 estimates, for each of the time intervals, an interval probability (also referred to as leaf probability in FIG. 2).
For example, the device 100 may determine the time intervals at which the plurality of alarms 121, 122, 123 are generated. Moreover, for each of the time intervals, the device 100 may estimate its respective interval probability by determining the possible paths connecting the second entity 412 and the first entity 411 in the subset of the hierarchical-structured data 410. Furthermore, the device 100 may determine the number of alarms that are associated with the first entity 411, the second entity 412, and the third entity 413 in each possible path.
Reference is now made to FIG. 5A and FIG. 5B, which are schematic views of a numerical example for obtaining the interval probabilities (shown in diagram 500A of FIG. 5 A) for the subgraph (the subset of hierarchical- structured data 410) shown in FIG. 5B.
The diagram 500A of FIG 5 A may be performed by the device 100 at step S205 of the procedure 200 shown in FIG. 2.
For example, the device 100 may repeat this process at many time intervals the group of alarms is split in. That means, with each new alarm received by the communication network, the representation of the communication network in the hierarchical- structured data 310 or the subset of hierarchical-structured data 410 (the sub-graph) may be updated with the entity that is associated with the recieved alarm, and the following procedure may be repeated.
The sub-graph G’ is a subset of hierarchical-structured data 410 that comprises the entities 411, 412, 413, their multi-directional links, and the number of associated alarms. For example, the entity 411 in the sub-graph 410 has a name of “tm-apigw” and a number of 138 alarms are associated with the entity 411.
The sub-graph G’ 410 shown in FIG. 5B may be parsed by the device 100, as follows:
• At first, the device 100 may determine for each second entity 412 (leaf entity), all the possible paths between that leaf enity and all the first entities 411 (root entities) of the subgraph 410.
• At second, the device 100 may pass each entity in each possible path, and may accumulate the number of alarms that are associated with that passed entity.
• At third, the device 100 may add, only once, the number of alarms associated with the current second entity 412 (leaf node) that is under the process of caluclating the interval probability (leaf probability).
The result of the above process may yield a number of associated alarms for each second entity 412 (leaf entity). This sum is proportional with the impact of each leaf entity to the overall incident. As it can be taken from FIG. 5 A and FIG. 5B, the entity referred to as “tm-odp#7” is affected by 76 alarms and the entity referred to as “tm-recommend’18” is affected by 302 alarms. Further, each leaf entity’s sum can be normalized by the grand total sum across all leaf entities, and that may yield a probability estimation for the root cause probability of the entity and/or the root cause probability of the group of alarms. The root cause probability of the entity referred to as “tm-odp#7” is 0.2, and the root cause probability of the entity referred to as ““tm- recommend’ 18” is 0 8
In some embodiments, the device 100 may further perform a procedure for an augmentation of missing entities.
For example, in some embodiments, it may be possible that an actual root cause of an incident has not yet raised any alarm. Furthermore, the device 100 may use the dynamic topology of historical interactions (captured in the hierarchical-structured data or G) to augment the incident graph (the subset of hierarchical- structured data or G’), which is a sub-graph of the whole entity graph G. For example, the device 100 may search the larger graph for determining connections between leafs in G' that include nodes that did not raise an alarm.
An example is shown in diagram 600 A of FIG. 6 A and diagram 600B of FIG. 6B, illustrating connected leaf entities (FIG. 6A) and a single entity and leaf connected (FIG. 6B).
In diagram 600 A of FIG. 6 A, the entity referred to as “tm-search#2” 611 has dependency connections to the entity referred to as “tm-order#l” 612. Hence, any alarm raised by the entity “tm-order#l” 612 may potentially affect the entity “tm-search#2” 611. Therefore, the “tm- order#l” entity’s affected alarm count will include paths from the root to it that traverse the “tm-search#2” entity 611 as well. In the example shown in diagram 600A of FIG 6 A, the entity referred to as “tm-order#l” 612 is suggested as the root cause of an incident.
Moreover, in diagram 600B of FIG. 6B, the entity referred to as “tm-search#6” 611 has no dependency connections to the entity referred to as “tm-order#5” 612. In this case, the connection disscussed with respect to FIG. 6A cannot be considered. Therefore, without considering the connection from the the “tm-search#6” entity to the “tm-topic#0” entity and further to the “tm-order#l” entity, the entity referred to as “tm-search#6” may be identified as the root cause of the incident.
At step S206 of the procedure 200 shown in FIG. 2, the device 100 may determine a temporal weighting function, and further apply the determined temporal weighting function to the estimation of the root cause probability of a group of alarms.
For example, in some embodiments, alarms may be generated at different times. Moreover, the device 100, when correlating the plurality of alarms, for obtaining groups of alarms, may add these alarms either to an existing incident, or to a new incident. Moreover, in the case that a new alarm is added to an existing incident, the root cause probabilities may further be adjusted.
For instance, in some embodiments, the device 100 may further obtain a state transition matrix (S.T.M) comprising the states of the generated incidents in the communication network 1. Moreover, when a new alarm is aggregated to an existing incident, a new state of the existing incident is generated, and may be included in the state transition matrix (referred to as S.T.M. in FIG. 2). Furthermore, for each new state, the device 100 may compute the root cause probabilities, and may further compute a timeline of probabilities. Finally, the device 100 may apply the temporal weighting function to these timeline of probabilities.
In some embodiments, the device 100 may learn (e..g, it may derive) the temporal weighting function from historical prior interactions (e.g., typical time information between raised alarms, typical duration time between the first and the last alarm in an incident, etc.). In one implementation, an exponential weighting function may be used, however, generally, the device 100 may use any weighting function.
FIG. 7 shows a diagram 700 illustrating an example of applying a temporal weighting function to the estimation of the root cause probability of a group of alarms.
As can be derived from diagram 700 of FIG. 7, for each new state, the device 100 recomputes (estimates again) the root cause probabilities (the estimated probabilities indicated with “Without weighting”), which are further weighted according to the time at which they were generated (the estimated probailities indicated with “With weighting”).
Furthermore, in some embodiments, the state of entities may evolve over time, and consequently, the root cause probability of each entity may also change with each new state or a new piece of information.
At step S207 of the procedure 200 shown in FIG. 2, the device 100 estimates the root cause probability for each group of alarms 131, 132 and further identifies the incident 141.
The incident 141 is a group of alarms, for which the estimated root cause probability has the highest value.
Moreover, the device 100 may use a priority information related to an entity for estimating the root cause probability for that entity in a current group of alarms, or in a current state, to be the root cause of an incident. The priority information may be an estimated prior root cause probability. For example, in some embodiments, the obtained data 110 can be collected over time about the root cause probability that an entity is the root cause of an incident. This information may also be collected from a domain expert. Moreover, the device 100 may augment the root cause probability of an entity based on the current state of the communication network 1 with the a- priori root cause probabilities using a probability boosting function. In some embodiments, the device 100 may further train the probability boosting function, when a ground truth is collected about past incidents.
This can also be applied (e.g., by the device 100) as a boosting factor onto the current transition matrix probabilities. For example, it can be applied onto each entity that has historical information (e.g., has a root cause probability) for being a root cause of an incident in the communication network. Moreover, this information may be obtained as independent and/or inter-dependent knowledge.
An example for the independent knowledge case is when that rate of failure of an entity is captured across all failures encountered. Moreover, an example for the inter-dependent knowledge case is when knowledge is captured about the probability of an entity being the root cause, given that it is grouped with certain other node in the same incident.
Reference is now made to FIG. 8, which is a schematic view of a diagram illustrating the device 100 identifying an incident in an incident management system.
In diagram 800 of FIG. 8, the device 100 is, as an example, integrated within a communication network that is an incident management system. The diagram of FIG. 8 represents the integration of the device 100 into the incident management system and the visualization of the information made available by the device 100 to a Site Reliability Engineer, who may use its output.
For example, the device 100 obtains data 110. The data 110 may comprise incident lists, time range intervals at which an incident is active, and a number of alarms associated with the incidents. Moreover, upon selection of one or more alarms, the device 100 obtains information related to the alarms, including the entities associated to the alarms. This information may be represented in a view 810.
Furthermore, the device 100 may obtain sub-graph G' 410. The sub-graph G' 410 may be obtained based on the alarms of the selected incident(s). The sub-graph G' 410 further shows the number of alarms associated with the entities, and, in case of the leaf node, the root cause probability for that leaf to be the root cause of the incident.
The device 100 may further identify an incident 141. As an example, for the sake of simplicity, a bar graph is presented that shows the entities sorted by their respective root cause probability of being the root cause of the incident. This examplary view enhances the human understanding of the scale of the difference between the different proposed root cause entities.
FIG. 9 shows a method 900 for monitoring a communication network according to an embodiment of the disclosure. The method 900 may be carried out by the device 100, as it is described above.
The method 900 comprises a step S901 of obtaining data 110 including topology information 111, wherein the topology information 111 is indicative of a plurality of entities of the communication network 1 and one or more interactions between some or all of the plurality of entities.
The method 900 further comprises a step S902 of obtaining a plurality of alarms 121, 122, 123, wherein each alarm 121, 122, 123 is associated with at least one of the plurality of entities.
The method 900 further comprises a step S903 of correlating the plurality of alarms 121, 122, 123 into one or more groups of alarms 131, 132, wherein each group of alarms 131, 132 is associated with a subset of the data that includes a subset of the topology information.
The method 900 further comprises a step S904 of identifying one or more incidents 141 from the one or more groups of alarms 131, 132, based on an estimation of a root cause probability for each group of alarms 131, 132 according to its associated subset of the data. For example, the estimation of the root cause probability of a given group of alarms 131, 132 represents a likelihood that an entity associated with an alarm in that given group of alarms is a root cause of an incident in the communication network 1.
The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed disclosure, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A device (100) for monitoring a communication network (1), the device (100) being configured to: obtain data (110) including topology information (111), wherein the topology information (111) is indicative of a plurality of entities of the communication network (1) and one or more interactions between some or all of the plurality of entities; obtain a plurality of alarms (121, 122, 123), wherein each alarm (121, 122, 123) is associated with at least one of the plurality of entities; correlate the plurality of alarms (121, 122, 123) into one or more groups of alarms (131, 132), wherein each group of alarms (131, 132) is associated with a subset of the data that includes a subset of the topology information; and identify one or more incidents (141) from the one or more groups of alarms (131, 132), based on an estimation of a root cause probability for each group of alarms (131, 132) according to its associated subset of the data.
2. The device (100) according to claim 1, further configured to: determine one or more subsets of the data based on the one or more groups of alarms (131, 132) and the obtained data (110).
3. The device (100) according to claim 1 or 2, further configured to: obtain hierarchical-structured data (310) based on the obtained data (110) according to one or more criteria, wherein the plurality of entities in the hierarchical-structured data (310) have hierarchical dependency relationships, wherein the hierarchical-structured data (310) comprises a plurality of links, and wherein each link represents one or more hierarchical dependency relationships between the plurality of entities of the communication network (1).
4. The device (100) according to claim 3, further configured to: obtain one or more subsets (410) of the hierarchical-structured data (310), based on the hierarchical-structured data (310) and the one or more groups of alarms (131, 132), wherein each subset (410) of the hierarchical-structured data (310) comprises a first entity (411) having at least one hierarchical dependency relationship to at least one other entity, a second entity (412) having no hierarchical dependency relationship to another entity, and a third entity (413) located between the first entity and the second entity.
5. The device (100) according to claim 4, further configured to: determine a number of alarms associated with each of the first entity (411), the second entity (412), and the third entity (413) of at least one subset of the hierarchical- structured data, based on the one or more groups of alarms (131, 132).
6. The device (100) according to any one of the claims 1 to 5, further configured to: obtain an additional alarm associated with at least one of the plurality of entities; and add the additional alarm to a group from the one or more groups of alarms or to a new group.
7. The device (100) according to claim 6, further configured to: adjust, when adding the additional alarm to the group from the one or more groups of alarms, the estimation of the root cause probability for that group.
8. The device (100) according to one of the claims 1 to 7, further configured to: correlate the plurality of alarms (121, 122, 123) into the one or more groups of alarms
(131, 132) based further on temporal information (111) of the communication network (1).
9. The device (100) according to claim 8, further configured to: determine, based on the temporal information (111) of the communication network (1), one or more time intervals, at which the plurality of alarms (121, 122, 123) are generated.
10. The device (100) according to claim 9, when depending on claim 4, further configured to estimate, for each of the time intervals, an interval probability by: determining one or more possible paths connecting the second entity (412) and the first entity (411) in the subset (410) of the hierarchical-structured data (310); and determining alarms associated with entities arranged in the one or more possible paths.
11. The device (100) according to claim 10, when depending on claim 4, further configured to: estimate the root cause probability for each group (131, 132) associated to a subset (410) of the hierarchical- structured data (310) based on the determined interval probabilities of that subset (410) of the hierarchical-structured data (310).
12. The device (100) according to one of the claims 8 to 11, when depending on claim 4, further configured to: estimate, based on the temporal information of the communication network (1), a prior root cause probability for at least one entity in at least one subset (410) of the hierarchical- structured data (310); and determine, based on the estimated prior root cause probability, a root cause probability for the at least one entity in a current group to be a root cause of an incident.
13. The device (100) according to one of the claims 8 to 12, further configured to: determine, based on the temporal information of the communication network (1), a temporal weighting function for the one or more groups of alarms (131, 132); and apply the determined temporal weighting function to the estimation of the root cause probability of at least one group of alarms (131, 132).
14. A method (900) for monitoring a communication network (1), the method (900) comprising: obtaining (S901) data (110) including topology information (111), wherein the topology information (111) is indicative of a plurality of entities of the communication network (1) and one or more interactions between some or all of the plurality of entities; obtaining (S902) a plurality of alarms (121, 122, 123), wherein each alarm (121, 122, 123) is associated with at least one of the plurality of entities; correlating (S903) the plurality of alarms (121, 122, 123) into one or more groups of alarms (131, 132), wherein each group of alarms (131, 132) is associated with a subset of the data that includes a subset of the topology information; and identifying (S904) one or more incidents (141) from the one or more groups of alarms (131, 132), based on an estimation of a root cause probability for each group of alarms (131, 132) according to its associated subset of the data.
15. A computer program product comprising instructions, which, when executed by a computer, cause the method (900) of claim 14 to be performed.
PCT/EP2020/065990 2020-06-09 2020-06-09 Device and method for monitoring communication networks WO2021249629A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/065990 WO2021249629A1 (en) 2020-06-09 2020-06-09 Device and method for monitoring communication networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/065990 WO2021249629A1 (en) 2020-06-09 2020-06-09 Device and method for monitoring communication networks

Publications (1)

Publication Number Publication Date
WO2021249629A1 true WO2021249629A1 (en) 2021-12-16

Family

ID=71094315

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/065990 WO2021249629A1 (en) 2020-06-09 2020-06-09 Device and method for monitoring communication networks

Country Status (1)

Country Link
WO (1) WO2021249629A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174251A (en) * 2022-07-19 2022-10-11 深信服科技股份有限公司 False alarm identification method and device for safety alarm and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110141914A1 (en) * 2009-12-15 2011-06-16 Chen-Yui Yang Systems and Methods for Providing Ethernet Service Circuit Management
US20150195154A1 (en) * 2014-01-08 2015-07-09 Telefonaktiebolaget L M Ericsson (Publ) Creating a Knowledge Base for Alarm Management in a Communications Network
US20180287856A1 (en) * 2017-03-28 2018-10-04 Ca, Inc. Managing alarms from distributed applications

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110141914A1 (en) * 2009-12-15 2011-06-16 Chen-Yui Yang Systems and Methods for Providing Ethernet Service Circuit Management
US20150195154A1 (en) * 2014-01-08 2015-07-09 Telefonaktiebolaget L M Ericsson (Publ) Creating a Knowledge Base for Alarm Management in a Communications Network
US20180287856A1 (en) * 2017-03-28 2018-10-04 Ca, Inc. Managing alarms from distributed applications

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JI SHUJIAN ET AL: "CMonitor: A Monitoring and Alarming Platform for Container-Based Clouds", 31 December 2019, LECTURE NOTES IN COMPUTER SCIENCE, PAGE(S) 324 - 339, ISSN: 0302-9743, XP047524935 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174251A (en) * 2022-07-19 2022-10-11 深信服科技股份有限公司 False alarm identification method and device for safety alarm and storage medium
CN115174251B (en) * 2022-07-19 2023-09-05 深信服科技股份有限公司 False alarm identification method and device for safety alarm and storage medium

Similar Documents

Publication Publication Date Title
EP3373516B1 (en) Method and device for processing service calling information
US6604208B1 (en) Incremental alarm correlation method and apparatus
EP1695485B1 (en) Method for automatically classifying a set of alarms emitted by sensors for detecting intrusions of a information security system
EP1279211B1 (en) Topology-based reasoning apparatus for root-cause analysis of network faults
US8015139B2 (en) Inferring candidates that are potentially responsible for user-perceptible network problems
US9836952B2 (en) Alarm causality templates for network function virtualization
US11348023B2 (en) Identifying locations and causes of network faults
JP6097889B2 (en) Monitoring system, monitoring device, and inspection device
US8245079B2 (en) Correlation of network alarm messages based on alarm time
US9524223B2 (en) Performance metrics of a computer system
US11252052B1 (en) Intelligent node failure prediction and ticket triage solution
US8483091B1 (en) Automatic displaying of alarms in a communications network
CN113467421B (en) Method for acquiring micro-service health status index and micro-service abnormity diagnosis method
US20210359899A1 (en) Managing Event Data in a Network
CN112433913B (en) Transaction path generation method, system, computer device and storage medium
WO2021249629A1 (en) Device and method for monitoring communication networks
US7796500B1 (en) Automated determination of service impacting events in a communications network
Wang et al. A methodology for root-cause analysis in component based systems
US7701843B1 (en) Intelligent-topology-driven alarm placement
EP3435233B1 (en) A method for identifying causality objects
KR20190132223A (en) Apparatus and method for analyzing cause of network failure
US8015278B1 (en) Automating alarm handling in a communications network using network-generated tickets and customer-generated tickets
US7986639B1 (en) Topology management of a communications network
RU2801825C2 (en) Method, complex for processing information about failures of devices of wireless sensor networks for data transmission and related networks
CN114422324B (en) Alarm information processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20732809

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20732809

Country of ref document: EP

Kind code of ref document: A1