WO2011014169A1 - Constructing a bayesian network based on received events associated with network entities - Google Patents

Constructing a bayesian network based on received events associated with network entities Download PDF

Info

Publication number
WO2011014169A1
WO2011014169A1 PCT/US2009/052222 US2009052222W WO2011014169A1 WO 2011014169 A1 WO2011014169 A1 WO 2011014169A1 US 2009052222 W US2009052222 W US 2009052222W WO 2011014169 A1 WO2011014169 A1 WO 2011014169A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
events
entities
records
bayesian
Prior art date
Application number
PCT/US2009/052222
Other languages
French (fr)
Inventor
Rajeev Dutt
Jonathan Bradshaw
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to EP09847912.4A priority Critical patent/EP2460105B1/en
Priority to PCT/US2009/052222 priority patent/WO2011014169A1/en
Priority to CN200980160660.3A priority patent/CN102640154B/en
Priority to US13/384,516 priority patent/US8938406B2/en
Publication of WO2011014169A1 publication Critical patent/WO2011014169A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods

Definitions

  • Some network environments may maintain knowledge databases (sometimes referred to as configuration management databases) regarding the configuration of the network.
  • knowledge databases sometimes referred to as configuration management databases
  • an administrator can consult the knowledge database to attempt to determine what impact the outage of defect would have on other parts of the network.
  • manually consulting this knowledge database to perform the diagnosis can be a time-consuming and tedious task, which may ultimately produce inaccurate results.
  • a knowledge database can become obsolete relatively quickly.
  • an automated process is provided to consult such a knowledge database to diagnose impacts of outages or defects at network entities, such automated processes may nevertheless produce inaccurate results if the knowledge database is not updated.
  • Fig. 1 is a block diagram of an exemplary arrangement that incorporates an embodiment of the invention
  • Fig. 2 is a flow diagram of a process of constructing and using a Bayesian network, according to an embodiment
  • Fig. 3 is a flow diagram of a process of mapping received records to a provided ontology, according to a further embodiment. Detailed Description
  • an automated learning system is provided to determine cause and effect relationships between events occurring in a network
  • network environments can include a relatively large number of network entities (which can be hardware entities, software entities, and/or combinations of hardware and software entities).
  • network entities can include computers, switches, routers, storage servers, and so forth.
  • Software entities can include software applications, web software, scripts, and so forth.
  • the automated learning system receives records of events associated with network entities in the network environment.
  • the events represented by the records are fault events that indicate something wrong has occurred at corresponding network entities.
  • the network entity may have crashed or may have produced an error that caused inaccurate outputs to be produced.
  • the events can represent other occurrences associated with the network entities. More generally, an "event" refers to an occurrence of some phenomenon, act, operation, alarm, and so forth, at or in connection with a network entity.
  • the records of the events are analyzed to identify relationships between events associated with different ones of the network entities. Each of the records of the events identifies a corresponding network entity impacted by the event.
  • the order in which the events are received is significant.
  • the event ordering can occur temporally (events received in time) or the event ordering can occur spatially (events received over a given space). In the former case, the events will indicate a causal (cause-and-effect) relationship, such as event A has a high likelihood of preceding event B. In the latter case, the events will indicate a spatial relationship, such as event A has a high likelihood of being near event B.
  • the automated learning system constructs a Bayesian network based on the analyzing.
  • the constructed Bayesian network is able to make predictions regarding relationships (e.g., causal relationships, spatial relationships, etc.) between events connected with the network elements. For example, the Bayesian network can predict events associated with some of the network entities based on detecting events at others of the network entities. As another example, the Bayesian network can diagnose a source of a problem based on detected events at one or more network entities. In addition, based on analyzing the events, the Bayesian network can be used to output a representation of the infrastructure of the network environment. This can assist administrators in maintaining updated system interconnections as changes are continually made in the network environment, which can be a tedious and time-consuming task.
  • relationships e.g., causal relationships, spatial relationships, etc.
  • a Bayesian network is a probabilistic structured representation of a domain to allow existing knowledge to be captured about the domain.
  • the Bayesian network is able to learn the stochastic properties of the domain (on a continual and real-time basis, for example) to update a model of the domain over time.
  • a Bayesian network has a directed acyclic graph structure, where the directed acyclic graph has nodes that represent variables from the domain, and arcs between the nodes represent dependencies between the variables.
  • the arcs of the Bayesian network also are associated with conditional probability distributions over the variables, where the conditional probability distributions encode the probability that variables assume different values given values of parent variables in the graph.
  • a Bayesian network is a graphical model for representing conditional dependencies between random variables of a domain.
  • the domain is a network environment having network entities that are associated with events, such as fault events.
  • the nodes of the Bayesian network represent corresponding network entities, and the arcs between the nodes are associated with conditional probability distributions that represent likelihoods of events associated with some of the network entities being related to events associated with others of the network entities.
  • Fig. 1 illustrates an exemplary arrangement in which some embodiments of the invention can be incorporated.
  • a network environment 102 includes various network entities 104, and possibly one or more monitoring agents 106.
  • the monitoring agents 106 can be part of the network entities 104 or separate from the network entities 104.
  • the monitoring agents 106 are used for monitoring operations of the network entities 104. Thus, any outages or defects at the network entities 104 can be detected by the monitoring agents 106.
  • the network entities 104 can be software entities, hardware entities, or combinations of software and hardware entities.
  • the monitoring agents 106 are able to create records of the events detected by the monitoring agents.
  • Fig. 1 also shows a call center 108.
  • the call center 108 can receive calls from users of the network environment 102 regarding any errors that are experienced by the users. Call agents at the call center 108 can then create records regarding the calls received about events that have occurred in the network environment 102.
  • the records generated at the call center 108 and/or the monitoring agents 106 can be sent to an analysis computer 100 over a network 110.
  • a "record" regarding an event refers to any representation of the event.
  • the record can have a predefined format, be in a predefined language, or can have any other predefined structure.
  • the record associated with a particular event identifies the network entity, such as by using a configuration identifier or some other type of identifier.
  • the records can also identify different types of events that may have occurred. For example, the records may identify different types of fault events (such as fault events that caused a network entity crash (outage), fault events that produced data error, software fault events, hardware fault events, fault events associated with defects, and so forth).
  • the records of the events are stored as events 112 in a storage media 114 in the computer 100.
  • the storage media 114 can be implemented with one or more disk-based storage devices and/or integrated circuit or semi-conductor memory devices.
  • the computer 100 includes analysis software 114 that is able to analyze the events 112 received from the call center 108 and/or monitoring agents 106.
  • the analysis software 114 is executable on one or more processors 116, which is (are) connected through a network interface 118 to the network 110 to allow the computer 100 to communicate over the network 110.
  • processors 116 which is (are) connected through a network interface 118 to the network 110 to allow the computer 100 to communicate over the network 110.
  • the computer 100 can refer to either a single computer node or to multiple computer nodes.
  • the analysis software 114 implements the automated learning system referred to above for analyzing events associated with network entities in a network environment for constructing a Bayesian network 120 that identifies relationships between the events associated with different ones of the network entities 104 in the network environment 102.
  • the constructed Bayesian network 120 is stored in the storage media 114. Note that although the Bayesian network 120 and analysis software 114 are shown as being two separate elements, it is noted that the Bayesian network 120 is part of the analysis software 114 to allow for the capture of knowledge about the network environment based on the records 112 of the events.
  • the Bayesian network 120 can continually update its model of the network environment based on continued receipt of records 112 of the events over time.
  • the analysis software 114 is able to construct inferences based on the frequency of event types and to automate the entire process from start to end.
  • the analysis software 114 looks at the propagation of fault events through the network environment 102 (as reported by the event records 112).
  • the relationships can be inferred from the frequency and occurrence of the events as detected by the call center 108 and/or by the monitoring agents 106.
  • the event records contain identifiers of corresponding network entities.
  • an ontology 122 is also created and stored in the storage media 114.
  • the ontology is a structured, machine- readable data model.
  • the ontology 122 models the concepts of the domain being analyzed, in this case the network environment 102.
  • the ontology 122 captures concepts of the domain (and relationships between the concepts) to provide a shared common understanding of the domain.
  • the ontology 122 serves as a repository of knowledge about the network
  • the ontology 122 provides a System class with a Components subclass that contains a simple diagnostic parameter that can take on one of the following three values: available, degraded and unavailable.
  • Each network entity can be associated with the foregoing ontology model. Depending upon the state of operation of the network entity, the network entity will have be associated with the diagnostic parameter that is assigned one of the foregoing three values.
  • the value available indicates that the network entity is operating normally.
  • the value degraded indicates that the network entity has degraded performance.
  • the value unavailable indicates that the network entity is down or otherwise not available.
  • the records that are incoming can include unstructured text, which may make conforming to the given ontology relatively difficult. However, if the records are defined to have specific tags that are consistent with the ontology, then an automated process can provided to extract information from the records according to the ontology.
  • the Bayesian network 120 In the process of learning the Bayesian network, analysis is performed of the frequency of the incoming events, categorized by event type, over a period of time. Based on the analyzed event records, the Bayesian network 120 is able to determine the likelihood that different events are related and also determine the type of relationship (e.g., whether it is a cause or an effect relationship). As noted above, there is an order associated with the incoming events, where the order can be a temporal order or a spatial order. A temporal ordering of the events allows for a causal relationship to be derived using the Bayesian network 120. However, a spatial ordering of the events allows for the Bayesian network 120 to learn a spatial relationship among events. In some embodiments, both temporal and spatial ordering of the events are considered in learning the Bayesian network 120.
  • the Bayesian network 120 can be used to make predictions. For example, the Bayesian network can predict if an event at network entity A will impact network entity B, or that failure at network entity D is likely caused by a failure at network entity C.
  • Fig. 2 is a flow diagram of building and using a Bayesian network, in accordance with an embodiment. The process of Fig. 2 can be performed by the analysis software 114 and Bayesian network 120 of Fig. 2.
  • a stream of records of events is received (at 202).
  • the events in some embodiments are fault events for indicating faults in the network environment 102 (Fig. 1).
  • the records can be received from monitoring agents 106 and/or the call center 108.
  • the information contained in the records of the fault events are analyzed (at 204).
  • the analysis involves looking at the propagation of faults along network entities in the network environment 102.
  • frequencies of fault events categorized by event type e.g., different types of faults
  • a relationship between events implies an underlying relationship between network entities that the events refer to.
  • Analyzing the frequencies of events categorized by event types allow the Bayesian network 120 to learn conditional probability distributions between fault events associated with the network entities. For example, if occurrences of fault events of a particular type at network entity A correlates frequently with fault events at network entities C and F, then the Bayesian will reflect this relationship in the arcs connecting nodes corresponding to network entities A, C, and F.
  • the Bayesian network 120 is updated (at 206).
  • the updated Bayesian network 120 is then used (at 208) to make predictions.
  • the predictions can be as follows: if a fault event occurs at network entity A, how will that impact network entity B; if a fault event occurred at network entity D, how likely is it that this fault event was caused by a failure at network entity C.
  • the outputs of the Bayesian network 120 can also be used to discover the network infrastructure of the network environment 102. Propagation of fault events along a particular path will reveal relationships among the network entities along that path. Since the records of events contain identifiers of the network entities, this information can be leveraged to build up a representation of the network infrastructure.
  • Fig. 2 The process of Fig. 2 can be recursively repeated to continually update the Bayesian network 120 as conditions change or as the infrastructure of the network environment 102 changes (e.g., network entities added, network entities removed, or network entities upgraded). In this manner, it is ensured that the model of the network environment 102 used is an updated representation that does not become obsolete quickly.
  • Fig. 3 is a flow diagram of a process according to a further embodiment.
  • An ontology of the domain to be modeled is provided (at 302), where the domain in this case is the network environment 102.
  • the ontology 122 provides a System class with a Components subclass that contains a simple diagnostic parameter that can take on one of the following three values: available, degraded and unavailable, as discussed above.
  • the received records of the events are mapped (at 304) to the ontology. This is to allow meaningful information that are relevant to learning the Bayesian network to be extracted. In cases where the received records contain unstructured data, pre-processing can be applied to perform the mapping. Alternatively, tag fields can be provided in the records that contain information relevant to the ontology.
  • the mapped records are provided (at 306) to the analysis software 114 and Bayesian network 120 to continue to learn the Bayesian network 120.
  • a processor includes microprocessors, microcontrollers, processor modules or subsystems (including one or more microprocessors or
  • microcontrollers or other control or computing devices.
  • Data and instructions (of the software) are stored in respective storage devices, which are implemented as one or more computer-readable or computer-usable storage media.
  • the storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; and optical media such as compact disks (CDs) or digital video disks (DVDs).
  • instructions of the software discussed above can be provided on one computer-readable or computer-usable storage medium, or alternatively, can be provided on multiple computer-readable or computer-usable storage media distributed in a large system having possibly plural nodes.
  • Such computer- readable or computer-usable storage medium or media is (are) considered to be part of an article (or article of manufacture).
  • An article or article of manufacture can refer to any manufactured single component or multiple components.

Abstract

Records of events associated with network entities in a network environment are received, where the network entities are selected from hardware entities, software entities, and combinations of hardware and software entities. The records of the events are identified to identify relationships between events associated with different ones of the network entities, where the records of the events identify corresponding network entities impacted by the events. A Bayesian network is constructed based on the analyzing, wherein the constructed Bayesian network is able to make predictions regarding relationships between events associated with the network elements.

Description

Constructing A Bayesian Network Based On
Received Events Associated With Network Entities
Background
[0001] In a network environment where there are a relatively large number of network entities that can span multiple geographic regions, it may be difficult to quickly identify the impact of an outage or defect at one or more network entities on other parts of the network.
[0002] Some network environments may maintain knowledge databases (sometimes referred to as configuration management databases) regarding the configuration of the network. In response to detected outages, an administrator can consult the knowledge database to attempt to determine what impact the outage of defect would have on other parts of the network. For a large network environment, manually consulting this knowledge database to perform the diagnosis can be a time-consuming and tedious task, which may ultimately produce inaccurate results.
[0003] Moreover, a knowledge database can become obsolete relatively quickly. Thus, even if an automated process is provided to consult such a knowledge database to diagnose impacts of outages or defects at network entities, such automated processes may nevertheless produce inaccurate results if the knowledge database is not updated.
Brief Description Of The Drawings
[0004] Some embodiments of the invention are described with respect to the following figures:
Fig. 1 is a block diagram of an exemplary arrangement that incorporates an embodiment of the invention;
Fig. 2 is a flow diagram of a process of constructing and using a Bayesian network, according to an embodiment;
Fig. 3 is a flow diagram of a process of mapping received records to a provided ontology, according to a further embodiment. Detailed Description
[0005] In accordance with some embodiments, an automated learning system is provided to determine cause and effect relationships between events occurring in a network
environment that includes network entities. Some network environments can include a relatively large number of network entities (which can be hardware entities, software entities, and/or combinations of hardware and software entities). For example, network entities can include computers, switches, routers, storage servers, and so forth. Software entities can include software applications, web software, scripts, and so forth.
[0006] The automated learning system receives records of events associated with network entities in the network environment. In some embodiments, the events represented by the records are fault events that indicate something wrong has occurred at corresponding network entities. For example, the network entity may have crashed or may have produced an error that caused inaccurate outputs to be produced. In other embodiments, the events can represent other occurrences associated with the network entities. More generally, an "event" refers to an occurrence of some phenomenon, act, operation, alarm, and so forth, at or in connection with a network entity.
[0007] The records of the events are analyzed to identify relationships between events associated with different ones of the network entities. Each of the records of the events identifies a corresponding network entity impacted by the event. The order in which the events are received is significant. The event ordering can occur temporally (events received in time) or the event ordering can occur spatially (events received over a given space). In the former case, the events will indicate a causal (cause-and-effect) relationship, such as event A has a high likelihood of preceding event B. In the latter case, the events will indicate a spatial relationship, such as event A has a high likelihood of being near event B. The automated learning system constructs a Bayesian network based on the analyzing.
[0008] The constructed Bayesian network is able to make predictions regarding relationships (e.g., causal relationships, spatial relationships, etc.) between events connected with the network elements. For example, the Bayesian network can predict events associated with some of the network entities based on detecting events at others of the network entities. As another example, the Bayesian network can diagnose a source of a problem based on detected events at one or more network entities. In addition, based on analyzing the events, the Bayesian network can be used to output a representation of the infrastructure of the network environment. This can assist administrators in maintaining updated system interconnections as changes are continually made in the network environment, which can be a tedious and time-consuming task.
[0009] A Bayesian network is a probabilistic structured representation of a domain to allow existing knowledge to be captured about the domain. The Bayesian network is able to learn the stochastic properties of the domain (on a continual and real-time basis, for example) to update a model of the domain over time. A Bayesian network has a directed acyclic graph structure, where the directed acyclic graph has nodes that represent variables from the domain, and arcs between the nodes represent dependencies between the variables. The arcs of the Bayesian network also are associated with conditional probability distributions over the variables, where the conditional probability distributions encode the probability that variables assume different values given values of parent variables in the graph. More generally, a Bayesian network is a graphical model for representing conditional dependencies between random variables of a domain. In accordance with some embodiments, the domain is a network environment having network entities that are associated with events, such as fault events.
[0010] In the context of representing a network environment having interconnected network entities, the nodes of the Bayesian network represent corresponding network entities, and the arcs between the nodes are associated with conditional probability distributions that represent likelihoods of events associated with some of the network entities being related to events associated with others of the network entities.
[0011] Fig. 1 illustrates an exemplary arrangement in which some embodiments of the invention can be incorporated. In Fig. 1, a network environment 102 includes various network entities 104, and possibly one or more monitoring agents 106. The monitoring agents 106 can be part of the network entities 104 or separate from the network entities 104. The monitoring agents 106 are used for monitoring operations of the network entities 104. Thus, any outages or defects at the network entities 104 can be detected by the monitoring agents 106. Note that the network entities 104 can be software entities, hardware entities, or combinations of software and hardware entities. The monitoring agents 106 are able to create records of the events detected by the monitoring agents.
[0012] Fig. 1 also shows a call center 108. The call center 108 can receive calls from users of the network environment 102 regarding any errors that are experienced by the users. Call agents at the call center 108 can then create records regarding the calls received about events that have occurred in the network environment 102.
[0013] The records generated at the call center 108 and/or the monitoring agents 106 can be sent to an analysis computer 100 over a network 110. A "record" regarding an event refers to any representation of the event. The record can have a predefined format, be in a predefined language, or can have any other predefined structure. The record associated with a particular event identifies the network entity, such as by using a configuration identifier or some other type of identifier. In some embodiments, the records can also identify different types of events that may have occurred. For example, the records may identify different types of fault events (such as fault events that caused a network entity crash (outage), fault events that produced data error, software fault events, hardware fault events, fault events associated with defects, and so forth).
[0014] The records of the events are stored as events 112 in a storage media 114 in the computer 100. The storage media 114 can be implemented with one or more disk-based storage devices and/or integrated circuit or semi-conductor memory devices. The computer 100 includes analysis software 114 that is able to analyze the events 112 received from the call center 108 and/or monitoring agents 106.
[0015] The analysis software 114 is executable on one or more processors 116, which is (are) connected through a network interface 118 to the network 110 to allow the computer 100 to communicate over the network 110. Although shown as a single block, it is contemplated that the computer 100 can refer to either a single computer node or to multiple computer nodes.
[0016] The analysis software 114 implements the automated learning system referred to above for analyzing events associated with network entities in a network environment for constructing a Bayesian network 120 that identifies relationships between the events associated with different ones of the network entities 104 in the network environment 102. The constructed Bayesian network 120 is stored in the storage media 114. Note that although the Bayesian network 120 and analysis software 114 are shown as being two separate elements, it is noted that the Bayesian network 120 is part of the analysis software 114 to allow for the capture of knowledge about the network environment based on the records 112 of the events. The Bayesian network 120 can continually update its model of the network environment based on continued receipt of records 112 of the events over time.
[0017] The analysis software 114 is able to construct inferences based on the frequency of event types and to automate the entire process from start to end. In some embodiments, the analysis software 114 looks at the propagation of fault events through the network environment 102 (as reported by the event records 112). The relationships can be inferred from the frequency and occurrence of the events as detected by the call center 108 and/or by the monitoring agents 106. As noted above, the event records contain identifiers of corresponding network entities.
[0018] In addition, to assist in constructing the Bayesian network 120, an ontology 122 is also created and stored in the storage media 114. The ontology is a structured, machine- readable data model. The ontology 122 models the concepts of the domain being analyzed, in this case the network environment 102. The ontology 122 captures concepts of the domain (and relationships between the concepts) to provide a shared common understanding of the domain. The ontology 122 serves as a repository of knowledge about the network
environment 102 to enable the construction of the Bayesian network 120.
[0019] In some implementations, the ontology 122 provides a System class with a Components subclass that contains a simple diagnostic parameter that can take on one of the following three values: available, degraded and unavailable. Each network entity can be associated with the foregoing ontology model. Depending upon the state of operation of the network entity, the network entity will have be associated with the diagnostic parameter that is assigned one of the foregoing three values. The value available indicates that the network entity is operating normally. The value degraded indicates that the network entity has degraded performance. The value unavailable indicates that the network entity is down or otherwise not available. Although a specific exemplary ontology is provided above, note that alternative implementations can employ other exemplary ontologies. [0020] The records that are incoming can include unstructured text, which may make conforming to the given ontology relatively difficult. However, if the records are defined to have specific tags that are consistent with the ontology, then an automated process can provided to extract information from the records according to the ontology.
[0021] In the process of learning the Bayesian network, analysis is performed of the frequency of the incoming events, categorized by event type, over a period of time. Based on the analyzed event records, the Bayesian network 120 is able to determine the likelihood that different events are related and also determine the type of relationship (e.g., whether it is a cause or an effect relationship). As noted above, there is an order associated with the incoming events, where the order can be a temporal order or a spatial order. A temporal ordering of the events allows for a causal relationship to be derived using the Bayesian network 120. However, a spatial ordering of the events allows for the Bayesian network 120 to learn a spatial relationship among events. In some embodiments, both temporal and spatial ordering of the events are considered in learning the Bayesian network 120.
[0022] Once the Bayesian network 120 is trained (learned), the Bayesian network can be used to make predictions. For example, the Bayesian network can predict if an event at network entity A will impact network entity B, or that failure at network entity D is likely caused by a failure at network entity C.
[0023] Fig. 2 is a flow diagram of building and using a Bayesian network, in accordance with an embodiment. The process of Fig. 2 can be performed by the analysis software 114 and Bayesian network 120 of Fig. 2.
[0024] A stream of records of events is received (at 202). The events in some embodiments are fault events for indicating faults in the network environment 102 (Fig. 1). As noted above, the records can be received from monitoring agents 106 and/or the call center 108.
[0025] The information contained in the records of the fault events are analyzed (at 204). The analysis involves looking at the propagation of faults along network entities in the network environment 102. Also, frequencies of fault events categorized by event type (e.g., different types of faults) are also analyzed. Since there is a correspondence between events and network entities (as identified by configuration identifiers in the records), a relationship between events implies an underlying relationship between network entities that the events refer to. Analyzing the frequencies of events categorized by event types allow the Bayesian network 120 to learn conditional probability distributions between fault events associated with the network entities. For example, if occurrences of fault events of a particular type at network entity A correlates frequently with fault events at network entities C and F, then the Bayesian will reflect this relationship in the arcs connecting nodes corresponding to network entities A, C, and F.
[0026] Based on the analysis of task 204, the Bayesian network 120 is updated (at 206). The updated Bayesian network 120 is then used (at 208) to make predictions. For example, the predictions can be as follows: if a fault event occurs at network entity A, how will that impact network entity B; if a fault event occurred at network entity D, how likely is it that this fault event was caused by a failure at network entity C.
[0027] It is noted that the outputs of the Bayesian network 120 can also be used to discover the network infrastructure of the network environment 102. Propagation of fault events along a particular path will reveal relationships among the network entities along that path. Since the records of events contain identifiers of the network entities, this information can be leveraged to build up a representation of the network infrastructure.
[0028] The process of Fig. 2 can be recursively repeated to continually update the Bayesian network 120 as conditions change or as the infrastructure of the network environment 102 changes (e.g., network entities added, network entities removed, or network entities upgraded). In this manner, it is ensured that the model of the network environment 102 used is an updated representation that does not become obsolete quickly.
[0029] Fig. 3 is a flow diagram of a process according to a further embodiment. An ontology of the domain to be modeled is provided (at 302), where the domain in this case is the network environment 102. In some implementations, the ontology 122 provides a System class with a Components subclass that contains a simple diagnostic parameter that can take on one of the following three values: available, degraded and unavailable, as discussed above. [0030] The received records of the events are mapped (at 304) to the ontology. This is to allow meaningful information that are relevant to learning the Bayesian network to be extracted. In cases where the received records contain unstructured data, pre-processing can be applied to perform the mapping. Alternatively, tag fields can be provided in the records that contain information relevant to the ontology.
[0031] Next, the mapped records are provided (at 306) to the analysis software 114 and Bayesian network 120 to continue to learn the Bayesian network 120.
[0032] By employing techniques according to some embodiments, a relatively convenient and automated way of predicting cause and effect relationships (or spatial relationships) among fault events (or other types of events) associated with corresponding network entities of a network environment is achieved. Administrators can be quickly informed of faults such that solutions can be developed, or temporary workaround plans can be developed.
[0033] Instructions of software described above (including the analysis software 114 and Bayesian network 120 of Fig. 1) are loaded for execution on a processor (such as processor(s) 116 in Fig. 1). A processor includes microprocessors, microcontrollers, processor modules or subsystems (including one or more microprocessors or
microcontrollers), or other control or computing devices.
[0034] Data and instructions (of the software) are stored in respective storage devices, which are implemented as one or more computer-readable or computer-usable storage media. The storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; and optical media such as compact disks (CDs) or digital video disks (DVDs). Note that the instructions of the software discussed above can be provided on one computer-readable or computer-usable storage medium, or alternatively, can be provided on multiple computer-readable or computer-usable storage media distributed in a large system having possibly plural nodes. Such computer- readable or computer-usable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components.
[0035] In the foregoing description, numerous details are set forth to provide an understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these details. While the invention has been disclosed with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover such modifications and variations as fall within the true spirit and scope of the invention.

Claims

What is claimed is: 1. A method comprising:
receiving records of events associated with network entities in a network
environment, wherein the network entities are selected from hardware entities, software entities, and combinations of hardware and software entities;
analyzing, by one or more processors, the records of the events to identify
relationships between events associated with different ones of the network entities, wherein the records of the events identify corresponding network entities impacted by the events; and constructing, by the one or more processors, a Bayesian network based on the analyzing, wherein the constructed Bayesian network is able to make predictions regarding relationships between events associated with the network elements.
2. The method of claim 1 , further comprising using the Bayesian network to predict events associated with some of the network entities based on detecting events at others of the network entities.
3. The method of claim 1 , further comprising using the Bayesian network to diagnose a source of a problem based on detected events at one or more network entities.
4. The method of claim 1 , further comprising using the Bayesian network to discover an infrastructure of the network environment.
5. The method of claim 1, further comprising:
continually receiving further records of the events associated with the network entities during operation of the network environment; and
updating the Bayesian network based on the further records of the events.
6. The method of claim 1, wherein receiving the records of the events comprises receiving the records of the events representing faults associated with the network entities.
7. The method of claim 6, wherein the faults include one or more of an outage of a network entity, a defect in a network entity, or a data error produced by a network entity.
8. The method of claim 1, wherein analyzing the records of the events comprises determining a propagation path of faults in the network environment.
9. The method of claim 1, wherein analyzing the records of the events comprises:
analyzing frequencies of the events in a predefined time interval; and
categorizing the events by event type.
10. The method of claim 1, further comprising defining an ontology that defines concepts used for learning the Bayesian network.
11. The method of claim 10, wherein the ontology defines a diagnostic parameter associated with each of the network entities that has a set of predefined potential values.
12. The method of claim 11 , wherein the set of predefined potential values includes a first value indicating that the corresponding network entity is operating normally, a second value indicating that the corresponding network entity has a degraded performance, and a third value indicating that the corresponding network entity is unavailable.
13. The method of claim 1, wherein receiving the records of the events comprises receiving the records of the events that have one or both of temporal and spatial ordering, and wherein constructing the Bayesian network takes into account the one or both of the temporal and spatial ordering.
14. A computer comprising:
a storage media to store records of events associated with network entities of a network environment; and
one or more processors to:
analyze the records of events to discover relationships between events, wherein the events identify corresponding network entities, wherein the network entities are selected from hardware entities, software entities, and combinations of hardware and software entities; using the discovered relationships and the corresponding identified network entities to learn a Bayesian network, and
use the Bayesian network to predict whether an event associated with one of the network entities is related to another event associated with another one of the network entities.
15. The computer of claim 14, wherein the events are fault events indicating occurrence of faults at the corresponding network entities.
16. The computer of claim 15, wherein the analysis of the records of the events determines a propagation path of faults in the network environment.
17. The computer of claim 14, wherein the discovered relationships comprises causal relationships between events based on temporal ordering of the events.
18. The computer of claim 14, wherein the discovered relationships comprises spatial relationships between events based on spatial ordering of the events.
19. An article comprising at least one computer-readable storage medium containing instructions that upon execution cause a computer to:
receive records of fault events indicating faults associated with network entities in a network environment, wherein the network entities are selected from hardware entities, software entities, and combinations of hardware and software entities;
analyze the records of the fault events to identify relationships between fault events associated with different ones of the network entities, wherein the records of the fault events identify corresponding network entities impacted by the fault events; and
construct a Bayesian network based on the analyzing, wherein the constructed Bayesian network is able to make predictions regarding relationships between fault events associated with the network elements.
20. The article of claim 19, wherein the instructions upon execution cause the computer to further perform one or more of: using the Bayesian network to predict fault events associated with some of the network entities based on detecting fault events at others of the network entities;
using the Bayesian network to diagnose a source of a problem based on detected fault events at one or more network entities; and
using the Bayesian network to discover an infrastructure of the network environment.
PCT/US2009/052222 2009-07-30 2009-07-30 Constructing a bayesian network based on received events associated with network entities WO2011014169A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP09847912.4A EP2460105B1 (en) 2009-07-30 2009-07-30 Constructing a bayesian network based on received events associated with network entities
PCT/US2009/052222 WO2011014169A1 (en) 2009-07-30 2009-07-30 Constructing a bayesian network based on received events associated with network entities
CN200980160660.3A CN102640154B (en) 2009-07-30 2009-07-30 Constructing a bayesian network based on received events associated with network entities
US13/384,516 US8938406B2 (en) 2009-07-30 2009-07-30 Constructing a bayesian network based on received events associated with network entities

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2009/052222 WO2011014169A1 (en) 2009-07-30 2009-07-30 Constructing a bayesian network based on received events associated with network entities

Publications (1)

Publication Number Publication Date
WO2011014169A1 true WO2011014169A1 (en) 2011-02-03

Family

ID=43529591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2009/052222 WO2011014169A1 (en) 2009-07-30 2009-07-30 Constructing a bayesian network based on received events associated with network entities

Country Status (4)

Country Link
US (1) US8938406B2 (en)
EP (1) EP2460105B1 (en)
CN (1) CN102640154B (en)
WO (1) WO2011014169A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2544401A1 (en) * 2011-07-08 2013-01-09 Alcatel Lucent Managing faults in a communication network based on statistical analysis
WO2014040633A1 (en) * 2012-09-14 2014-03-20 Huawei Technologies Co., Ltd. Identifying fault category patterns in a communication network
CN108320040A (en) * 2017-01-17 2018-07-24 国网重庆市电力公司 Acquisition terminal failure prediction method and system based on Bayesian network optimization algorithm
US10430417B2 (en) 2016-03-10 2019-10-01 Tata Consultancy Services Limited System and method for visual bayesian data fusion

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9424288B2 (en) 2013-03-08 2016-08-23 Oracle International Corporation Analyzing database cluster behavior by transforming discrete time series measurements
US10373065B2 (en) * 2013-03-08 2019-08-06 Oracle International Corporation Generating database cluster health alerts using machine learning
EP2846295A1 (en) 2013-09-05 2015-03-11 Siemens Aktiengesellschaft Query answering over probabilistic supply chain information
DE102013224378A1 (en) * 2013-09-18 2015-03-19 Rohde & Schwarz Gmbh & Co. Kg Automated evaluation of test protocols in the telecommunications sector
US10163420B2 (en) * 2014-10-10 2018-12-25 DimensionalMechanics, Inc. System, apparatus and methods for adaptive data transport and optimization of application execution
US9699205B2 (en) * 2015-08-31 2017-07-04 Splunk Inc. Network security system
US10728085B1 (en) * 2015-09-15 2020-07-28 Amazon Technologies, Inc. Model-based network management
US10389742B2 (en) * 2015-10-21 2019-08-20 Vmware, Inc. Security feature extraction for a network
US10831811B2 (en) 2015-12-01 2020-11-10 Oracle International Corporation Resolution of ambiguous and implicit references using contextual information
WO2017105343A1 (en) * 2015-12-18 2017-06-22 Hitachi, Ltd. Model determination devices and model determination methods
US10237294B1 (en) 2017-01-30 2019-03-19 Splunk Inc. Fingerprinting entities based on activity in an information technology environment
US10616043B2 (en) * 2017-11-27 2020-04-07 Google Llc Real-time probabilistic root cause correlation of network failures
US11176474B2 (en) * 2018-02-28 2021-11-16 International Business Machines Corporation System and method for semantics based probabilistic fault diagnosis
CN113271216B (en) * 2020-02-14 2022-05-17 华为技术有限公司 Data processing method and related equipment
CN113537757B (en) * 2021-07-13 2024-02-09 北京交通大学 Analysis method for uncertain risk of rail transit system operation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020019870A1 (en) * 2000-06-29 2002-02-14 International Business Machines Corporation Proactive on-line diagnostics in a manageable network
US20050021485A1 (en) * 2001-06-28 2005-01-27 Microsoft Corporation Continuous time bayesian network models for predicting users' presence, activities, and component usage
US20050114739A1 (en) * 2003-11-24 2005-05-26 International Business Machines Corporation Hybrid method for event prediction and system control
US20080168020A1 (en) * 2003-07-18 2008-07-10 D Ambrosio Bruce Douglass Relational bayesian modeling for electronic commerce

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6076083A (en) * 1995-08-20 2000-06-13 Baker; Michelle Diagnostic system utilizing a Bayesian network model having link weights updated experimentally
US6807537B1 (en) * 1997-12-04 2004-10-19 Microsoft Corporation Mixtures of Bayesian networks
US6442694B1 (en) * 1998-02-27 2002-08-27 Massachusetts Institute Of Technology Fault isolation for communication networks for isolating the source of faults comprising attacks, failures, and other network propagating errors
US6535865B1 (en) 1999-07-14 2003-03-18 Hewlett Packard Company Automated diagnosis of printer systems using Bayesian networks
US6738933B2 (en) * 2001-05-09 2004-05-18 Mercury Interactive Corporation Root cause analysis of server system performance degradations
US6957202B2 (en) 2001-05-26 2005-10-18 Hewlett-Packard Development Company L.P. Model selection for decision support systems
US7426502B2 (en) 2001-06-14 2008-09-16 Hewlett-Packard Development Company, L.P. Assessing health of a subsystem or service within a networked system
US6990486B2 (en) * 2001-08-15 2006-01-24 International Business Machines Corporation Systems and methods for discovering fully dependent patterns
US20040143561A1 (en) 2002-11-14 2004-07-22 Jensen Finn Verner Method for problem solving in technical systems with redundant components and computer system for performing the method
US7062683B2 (en) * 2003-04-22 2006-06-13 Bmc Software, Inc. Two-phase root cause analysis
US7747717B2 (en) * 2003-08-14 2010-06-29 Oracle International Corporation Fast application notification in a clustered computing system
CN100456687C (en) * 2003-09-29 2009-01-28 华为技术有限公司 Network failure real-time relativity analysing method and system
ATE366011T1 (en) 2003-10-21 2007-07-15 Hewlett Packard Development Co METHOD FOR MONITORING COMPUTER SYSTEMS
US20050216585A1 (en) * 2004-03-26 2005-09-29 Tsvetelina Todorova Monitor viewer for an enterprise network monitoring system
US7536370B2 (en) * 2004-06-24 2009-05-19 Sun Microsystems, Inc. Inferential diagnosing engines for grid-based computing systems
FR2873879B1 (en) 2004-07-30 2006-10-27 Cit Alcatel COMMUNICATION NETWORK MANAGEMENT SYSTEM FOR AUTOMATICALLY REPAIRING FAULTS
WO2008000290A1 (en) * 2006-06-30 2008-01-03 Telecom Italia S.P.A. Fault location in telecommunications networks using bayesian networks
US8660018B2 (en) 2006-07-31 2014-02-25 Hewlett-Packard Development Company, L.P. Machine learning approach for estimating a network path property
EP2122996A1 (en) * 2007-03-08 2009-11-25 Telefonaktiebolaget LM Ericsson (PUBL) An arrangement and a method relating to performance monitoring
US20100070589A1 (en) * 2008-09-17 2010-03-18 At&T Intellectual Property I, L.P. Intelligently anticipating and/or prioritizing events associated with a wireless client
US8209272B2 (en) * 2009-02-27 2012-06-26 Red Hat, Inc. Dynamic computation of optimal placement for services in a distributed computing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020019870A1 (en) * 2000-06-29 2002-02-14 International Business Machines Corporation Proactive on-line diagnostics in a manageable network
US20050021485A1 (en) * 2001-06-28 2005-01-27 Microsoft Corporation Continuous time bayesian network models for predicting users' presence, activities, and component usage
US20080168020A1 (en) * 2003-07-18 2008-07-10 D Ambrosio Bruce Douglass Relational bayesian modeling for electronic commerce
US20050114739A1 (en) * 2003-11-24 2005-05-26 International Business Machines Corporation Hybrid method for event prediction and system control

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2460105A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2544401A1 (en) * 2011-07-08 2013-01-09 Alcatel Lucent Managing faults in a communication network based on statistical analysis
WO2014040633A1 (en) * 2012-09-14 2014-03-20 Huawei Technologies Co., Ltd. Identifying fault category patterns in a communication network
US10430417B2 (en) 2016-03-10 2019-10-01 Tata Consultancy Services Limited System and method for visual bayesian data fusion
CN108320040A (en) * 2017-01-17 2018-07-24 国网重庆市电力公司 Acquisition terminal failure prediction method and system based on Bayesian network optimization algorithm
CN108320040B (en) * 2017-01-17 2021-01-26 国网重庆市电力公司 Acquisition terminal fault prediction method and system based on Bayesian network optimization algorithm

Also Published As

Publication number Publication date
CN102640154B (en) 2015-03-25
EP2460105A1 (en) 2012-06-06
EP2460105B1 (en) 2014-10-01
US20120117009A1 (en) 2012-05-10
CN102640154A (en) 2012-08-15
EP2460105A4 (en) 2013-01-23
US8938406B2 (en) 2015-01-20

Similar Documents

Publication Publication Date Title
US8938406B2 (en) Constructing a bayesian network based on received events associated with network entities
AU2016213726B2 (en) Core network analytics system
Farshchi et al. Metric selection and anomaly detection for cloud operations using log and metric correlation analysis
US9589229B2 (en) Dynamic model-based analysis of data centers
US11269718B1 (en) Root cause detection and corrective action diagnosis system
US9652316B2 (en) Preventing and servicing system errors with event pattern correlation
Farshchi et al. Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis
US9612892B2 (en) Creating a correlation rule defining a relationship between event types
US20170075749A1 (en) Method And System For Real-Time Causality And Root Cause Determination Of Transaction And Infrastructure Related Events Provided By Multiple, Heterogeneous Agents
US9280409B2 (en) Method and system for single point of failure analysis and remediation
US11886276B2 (en) Automatically correlating phenomena detected in machine generated data to a tracked information technology change
US20110320228A1 (en) Automated Generation of Markov Chains for Use in Information Technology
WO1999045468A1 (en) System and method for optimizing performance monitoring of complex information technology systems
US20060123278A1 (en) Method and apparatus for generating diagnoses of network problems
CN114064196A (en) System and method for predictive assurance
CN116881737B (en) System analysis method in industrial intelligent monitoring system
JP5240709B2 (en) Computer system, method and computer program for evaluating symptom
Mamoutova et al. Knowledge based diagnostic approach for enterprise storage systems
US11388039B1 (en) Identifying problem graphs in an information technology infrastructure network
KR20030056301A (en) System hindrance integration management method
Kleehaus et al. Discovery of Microservice-based IT Landscapes at Runtime: Algorithms and Visualizations.
US10735246B2 (en) Monitoring an object to prevent an occurrence of an issue
Ding et al. Backward inference in bayesian networks for distributed systems management
Parra-Ullauri et al. Towards an architecture integrating complex event processing and temporal graphs for service monitoring
Farshchi et al. Anomaly Detection of Cloud Application Operations Using Log and Cloud Metric Correlation Analysis

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200980160660.3

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09847912

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 13384516

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2009847912

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE