US20200106660A1 - Event based service discovery and root cause analysis - Google Patents
Event based service discovery and root cause analysis Download PDFInfo
- Publication number
- US20200106660A1 US20200106660A1 US16/145,553 US201816145553A US2020106660A1 US 20200106660 A1 US20200106660 A1 US 20200106660A1 US 201816145553 A US201816145553 A US 201816145553A US 2020106660 A1 US2020106660 A1 US 2020106660A1
- Authority
- US
- United States
- Prior art keywords
- event
- components
- service domain
- events
- root cause
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/065—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/542—Event management; Broadcasting; Multicasting; Notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0604—Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
- H04L41/0618—Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time based on the physical or logical position
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/0645—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis by additionally acting on or stimulating the network after receiving notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/22—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
Definitions
- the disclosure generally relates to the field of information security, and more particularly to software development, installation, and management.
- Network topology describes connections between physical components of a network and may not describe relationships between software components.
- Events are generated by a variety of sources or components, including hardware and software. Events may be specified in messages that can indicate numerous activities, such as an application finishing a task or a server failure.
- FIG. 1 depicts an example network management system which performs event-based identification of service domains and root cause analysis.
- FIG. 2 depicts service domains identified based on event correlation.
- FIG. 3 depicts covariance matrices used to identify relationships between components based on event correlation.
- FIG. 4 depicts an example of using event sequence mining to perform event correlation and identify relationships between components.
- FIG. 5 depicts a flowchart with example operations for performing event-based identification of service domains and root cause analysis.
- FIG. 6 depicts an example computer system with a service domain identifier and root cause analyzer.
- a system uses event correlation to identify components belonging to a same service or service domain.
- the system correlates events by generating covariance matrices or by performing sequence mining with temporal databases in order to discover event patterns (or episodes of events) that occur sequentially in a time window.
- Components corresponding to the correlated events are identified as being part of a same service domain and can be indicated in a service domain data structure, such as a topology.
- the system utilizes the identified service domains during root cause analysis.
- the system can determine an anomalous event occurring a lowest layer component in a service domain as a root cause or can determine an anomalous event which occurs first in an identified event sequence of a service domain as a root cause. After identifying the root cause event, the system suppresses notifications of events occurring at other components in the service domain to avoid providing superfluous notifications through network management software to an administrator.
- component as used in the description below encompasses both hardware and software resources.
- the term component may refer to a physical device such as a computer, server, router, etc.; a virtualized device such as a virtual machine or virtualized network function; or software such as an application, a process of an application, database management system, etc.
- a component may include other components.
- a server component may include a web service component which includes a web application component.
- An event is an occurrence in a system or in a component of the system at a point in time.
- An event often relates to resource consumption and/or state of a system or system component.
- an event may be that a file was added to a file system, that a number of users of an application exceeds a threshold number of users, that an amount of available memory falls below a memory amount threshold, or that a component stopped responding or failed.
- An event indication can reference or include information about the event and is communicated to by an agent or probe to a component/agent/process that processes event indications.
- Example information about an event includes an event type/code, application identifier, time of the event, severity level, event identifier, event description, etc.
- correlating events or event correlation involves identifying events that have a connection or relationship to one another, such as a temporal connection, cause-and-effect relationship, statistical relationship, etc.
- Correlating events or event correlation as used herein refers to the identification of this existing relationship and does not include modifying events to establish a connection or relationship.
- a service domain can refer to a collection of resources or components which are utilized in providing a service, such as an application, a database, a web server, etc.
- a service domain can include a cloud storage application, a virtual machine which executes the application, a hypervisor underlying the virtual machine, a server hosting the hypervisor, and a router which connects the server to a network.
- FIG. 1 depicts an example network management system which performs event-based identification of service domains and root cause analysis.
- FIG. 1 depicts a virtual machine A 101 , a virtual machine B 120 , a server 121 , a storage system 122 , and an event collector 105 that are connected through a network 104 .
- the virtual machine A 101 includes an application 102 .
- FIG. 1 also depicts a network management system 110 that includes an event correlator 107 and a root cause analyzer 109 .
- the event collector 105 , the event correlator 107 , and the root cause analyzer 109 are communicatively coupled to an event database 106 .
- the event collector 105 receives events from components in the network 104 and stores them in the event database 106 .
- the event collector 105 may receive the events from agents of the components in the network 104 .
- the event collector 105 receives Events 1-5 and stores them in the event database 106 .
- Event 1 indicates that the processor load for the virtual machine A 101 was at 95% at time 1:00
- Event 2 indicates that the response time for the virtual machine B 120 was 500 milliseconds at time 1:01.
- Event 3 indicates that the application 102 invoked the storage system 122 five times at time 1:15, and
- Event 4 indicates that the storage system 122 had a response time of 100 milliseconds at time 1:16.
- Event 5 indicates that the server 121 had a processor load of 85% at time 1:20.
- the event indications may include additional information that is not depicted.
- Event 1 may indicate that the processor load is an average for a certain time period and may include a minimum and maximum load for the time period.
- Events 1-5 are examples of particular types of event indictors that may be received by the event collector 105 .
- the event collector 105 also receives and stores event indications of other types in the event database 106 that are not depicted.
- the event correlator 107 retrieves and correlates events in the event database 106 to identify components for service domains 108 .
- Event correlation refers to the identification of a relationship or statistical connection between two or more events. Events can be correlated based on a determination that a first event caused a second event, that a first series of events caused a second series of events, that two events often occur near simultaneously, etc.
- the event correlator 107 can also correlate events based on a statistical, causal, or probability analysis using a statistical correlation/covariance matrix, as described in more detail in FIG. 3 .
- the event correlator 107 can also correlate events based on sequence mining or identification of repetitive event patterns (i.e., temporally sequential series of events), as described in more detail in FIG. 4 .
- the event correlator 107 may determine that there is a correlation between the event 3 in which the application 102 invokes the storage system 122 and the event 4 which occurs a minute later and indicates a slow response time at the storage system 122 .
- the event correlator 107 can validate correlations over multiple time periods.
- the event correlator 107 may increase a correlation probability based on identifying a pattern in past events indicating that an event with a slow response time for the storage system 122 frequently occurs after events indicating invocations of the storage system 122 by the application 102 .
- a correlation between events indicates a relationship between the corresponding components.
- Event correlation can reveal component relationships which may not be apparent from network topology information, and these relationships can be identified without requiring extensive manual input by an administrator.
- the event correlator 107 uses the determined relationships to identify components which are part of a same service domain.
- the event correlator 107 indicates the components in the service domains 108 which includes the example service domain 1 115 .
- the service domain 1 115 includes the application 102 , the virtual machine A 101 , the storage system 122 , a hypervisor, and a router which may also be part of the network 104 .
- the event correlator 107 included these components in a same service domain based on determining a correlation between events of these components, such as the example correlation described above between events of the application 102 and the storage system 122 . Additionally, the event correlator 107 may have included the router, a physical layer component, in the service domain 1 115 based on determining that the router was utilized by logical layer components such as the application 102 . Relationships between physical and logical layer components can be identified using a reverse lookup or through a network topology provided to the event correlator 107 .
- the layer to which a component is assigned may be based on the Open Systems Interconnection model (OSI model). Layers can include multiple components, e.g. routers and switches may be on a same layer.
- OSI model Open Systems Interconnection model
- the event correlator 107 can provide the service domains 108 in a graph data structure that includes nodes identifying the components and edges indicating the relationships between the components or may indicate the components in a list.
- the event correlator 107 can include data in the service domains 108 such as a type of correlation used to identify relationships, a determined probability of correlation, event attribute values, a corresponding network or service layer for each component, etc. If sequence mining was used to identify event correlations, the event correlator 107 can label the graph data structure or otherwise indicate in the service domains 108 a sequence in which events typically occur at the components. For example, the event correlator 107 may label the service domain 1 115 to indicate that an event typically first occurs at the application 102 and then an event occurs at the storage system 122 .
- the root cause analyzer 109 performs root cause analysis using the service domains 108 and events in the event database 106 .
- the root cause analyzer 109 may monitor the event database 106 to identify one or more anomalous events occurring at the components.
- An anomalous event is an event that indicates a network occurrence or condition that deviates from a normal or expected value or outcome. For example, an event may have an attribute value that exceeds or falls below a determined threshold or required value, or an event may indicate that a component shut down or restarted prior to a scheduled time. Additionally, an anomalous event may be an event that indicates a network issue such as a component failure.
- the root cause analyzer 109 After identifying one or more anomalous events, the root cause analyzer 109 identifies one or more service domains from the service domains 108 which include components corresponding to the anomalous events. The root cause analyzer 109 then utilizes the identified service domain(s) to aid in the root cause analysis process. For example, if an anomalous event, such as a slow response time, occurred at the application 102 , the root cause analyzer 109 identifies the service domain 1 115 from the service domains 108 . The root cause analyzer 109 then identifies related components in the service domain 1 115 and retrieves events for those components from the event database 106 .
- an anomalous event such as a slow response time
- the root cause analyzer 109 identifies an anomalous event occurring at a lowest layer component in the service domain 1 115 and outputs that event as a root cause event 111 . For example, if a high processor load event was occurring at the hypervisor, which is a lower layer component than the application 102 , the root cause analyzer 109 prioritizes the high processor load event as the root cause and outputs that event as the root cause event 111 . In another implementation, the root cause analyzer 109 may utilize an event sequence or pattern indicated in the service domain 1 115 to identify which component typically starts the series of events resulting in an anomaly.
- the root cause analyzer 109 outputs an event at the application 102 as the root cause event 111 .
- the root cause analyzer 109 may also output related events 112 which occur at other components in the service domain 1 115 ; however, as indicated by the dashed lines in FIG. 1 , the related events 112 may be hidden or suppressed so that an administrator is not overwhelmed with alarms or notifications of anomalous events or other possible root causes.
- the root cause analyzer 109 suppresses events generated by the components in the service domain 1 115 while an issue causing the anomalous events is still occurring.
- the root cause analyzer 109 can suppress events by filtering events using identifiers for the components in the service domain 1 115 and preventing the filtered events from being sent for display. Once the issue has been resolved and the components in the service domain 1 115 are functioning properly, the root cause analyzer 109 resumes normal generation of event notifications.
- FIG. 2 depicts service domains identified based on event correlation.
- FIG. 2 shows a service domain 1 201 and a service domain 2 202 which each comprise a subset of components executing within different layers of a network 203 .
- the components in the service domains may have been determined to be related based on event correlation through sequence mining or through a covariance matrix.
- the components are related in that they function together to a provide sessions for an IP telepresence service: the service domain 1 201 includes a session 1 and the service domain 2 202 includes a session 2.
- IP-multicast groups IP-quality of service (QoS) classes, and layer-3 (L3) network paths, a border gateway protocol (BGP), multiprotocol label switching paths (MPLS LSP), a virtual local area network (VLAN), and a router.
- QoS IP-quality of service
- L3 layer-3
- BGP border gateway protocol
- MPLS LSP multiprotocol label switching paths
- VLAN virtual local area network
- root cause analysis of the session can be simplified by limiting the analysis to the components in the service domain 1 201 .
- other information about the service domain 1 201 may be utilized to identify a root cause.
- root causes are inferred based on a lowest layer component in the service domain 1 201 which is experiencing an issue. For example, if the “Group 1” IP-multicast is experiencing an issue and the router is experiencing an issue, it is determined that the router issue is the root cause of problems for the service domain 1 201 , as the router is at a lower layer than the IP-multicast.
- a component's layer can be determined based on a component type, an assigned or logical OSI layer, etc.
- a components layer can be determined relative to other components. For example, a virtual machine is considered a higher layer than the hypervisor on which it executes. Similarly, a server which executes the hypervisor is at a higher layer than a router which it uses for transmitting network traffic.
- alarms or notifications for other components in the service domain 1 201 can be suppressed, e.g., not displayed to a user.
- “Session 2” of the service domain 2 202 is also experiencing issues, alarms or notifications for other components in the service domain 2 202 can also be suppressed and, ultimately, only a single event or notification identifying the root cause is presented, thereby avoiding overloading an interface of network management software with notifications.
- Events for components of the service domains may be suppressed until the issue is resolved and then event notification may continue as normal.
- FIG. 3 depicts covariance matrices used to identify relationships between components based on event correlation.
- FIG. 3 depicts an event correlator 307 which produces a covariance matrix 301 at a stage A and a set of covariance matrices 302 over multiple time periods during a stage B.
- the columns and rows of the matrices identify components in a network, such as the components depicted in FIG. 2 .
- the entries in the matrices represent the correlation or covariance of events between the components.
- covariance is a measure of the joint variability of two random variables. In the case of FIG. 3 , the random variables are events occurring at the components.
- the covariance analysis of the events generates a number between 0 and 1 indicating the probability that there is a correlation between the events, 1 being a high probability and 0 being a low probability.
- a threshold can be set to determine whether the probability of correlation is high enough to confidently determine that two components are related and belong to a same service domain.
- the threshold is 85%, and the entries in the matrices which satisfy the threshold are bolded and underlined.
- the event correlator 307 may determine that the components are part of a same service domain.
- the event correlator 307 generates a first matrix, the matrix 301 , based on event correlation.
- the event correlator 307 may use events from an event log from a first or most recent time period to generate the matrix 301 .
- the event correlator 307 may use events generated in the previous 10 minutes or events from a first 30 minutes of operation of the components. Since the matrix 301 is based on correlation from just a single time period, the matrix 301 is treated as a hypothesis and is tested/validated as additional events are received and analyzed.
- the event correlator 307 continues collecting and analyzing events over multiple time periods to generate the set of covariance matrices 302 .
- the statistical power of the correlations increases, thereby decreasing the risk of making a Type II error.
- a type II error refers to the failure to reject a false hypothesis; a hypothesis being in this instance that there is a correlation between events of two components as shown in the matrix 301 .
- the event correlator 307 may continue collecting and analyzing events over multiple time periods until the probability of making a Type II error falls below a threshold or, stated differently, until the statistical power has exceeded a threshold.
- the consistency with which a correlation is identified over the multiple time periods indicates the confidence which can be place in the identified correlation. For example, if the event correlator generates three matrices over three time periods and a threshold satisfying correlation appears in all three, then the event correlator 307 can have high confidence in the correlation and the correlation likely has high statistical power.
- the event correlator 307 can output a matrix based on an aggregation of the set of covariance matrices 302 or a list of related components identified based on the correlation.
- FIG. 4 depicts an example of using event sequence mining to perform event correlation and identify relationships between components.
- FIG. 4 depicts an event key 401 , event log 402 , and a mined pattern 403 generated by an event correlator 407 .
- the event key 401 identifies known entities or components in a system along with their component identifiers. Additionally, the event key 401 identifies types of events for those components along with event identifiers.
- the event log 402 shows a number of events stored in an event database 406 which have been sorted according to an associated timestamp. Each event indication in the event log 402 indicates a component identifier for the corresponding component and an event identifier for the type of event which occurred.
- the first event indication in the event log 402 occurred at a component with identifier “1”, which, as shown in the event key 401 , is the component “port.” Additionally, the first event indication has an event identifier of “16” which corresponds to the event type of “port down.”
- the event correlator 407 uses sequence mining on the temporal listing of events 402 in order to discover the patterns (or episodes of events) that occur sequentially in a fixed time window. This approach allows discovery of patterns which occur repeatedly with a high confidence index thereby making the mined pattern causal and not co-incidental.
- the event correlator 407 may mine the data using a priori algorithms to identify sequences. If an event pattern or episode is recognized within a specified timeframe with a high confidence index on causality (based on factors like number of repetitions, probabilistic distribution, etc.), then that episode is a set of events that occur one after the other and are correlated. Components associated with that set of events are then indicated as being part of a same service domain.
- the event correlator 407 mines the event log 402 using a mining algorithm to identify patterns or sequences of events.
- the sequence mining algorithm may be the PrefixSpan algorithm.
- the mined pattern 403 is identified.
- the mined pattern 403 is the longest subsequence which is repeated in the event log 402 .
- the mined pattern 403 may be further processed to determine a confidence index of causality between the events in the mined pattern 403 . If there is a high confidence of causality, it is determined that the events and their corresponding components are related and are part of a same service domain.
- the event correlator 407 may perform the sequence mining over multiple time periods to improve the statistical power of the correlations prior to making a determination that the events in the mined pattern 403 , and their corresponding components, are related.
- the mined pattern 403 is just one example pattern for a service domain.
- a service domain can include multiple event patterns or sequences involving one or more of the same components and event types. For example, instead of a sequence beginning with the port down event, a sequence may begin with a QoS violation event which causes events at the multicast and video conference components, such as slow response times.
- events generated in a service domain can be compared to the one or more event patterns associated with the service domain to determine which pattern is occurring. The identified pattern is then used to determine a root cause of an issue.
- FIG. 5 depicts a flowchart with example operations for performing event-based identification of service domains and root cause analysis. The operations of FIG. 5 are described as being performed by a network management system for consistency with FIG. 1 , although naming of program code can vary among implementations.
- a network management system retrieves events from an event log for analysis ( 502 ).
- the system may query an event database to retrieve events or may subscribe to an event management service which forwards batches of events to the system.
- the system may sort the events into a chronological order, filter for events of a particular type, or otherwise prepare the collection of events for analysis.
- the system begins operations for multiple time periods represented by the events ( 504 ).
- the system may divide or split the events into time periods for processing. For example, the system may split the events into collections of five-minute periods. Alternatively, in some implementations, the system may divide the events into sets of a number of events, e.g., 100 events per set.
- the time period or collection of events currently being processed is hereinafter referred to as “events for the selected time period.”
- the system identifies correlations of events for the selected time period ( 506 ).
- the system analyzes the events and may generate a covariance matrix for components represented in the events or perform sequence mining on the events.
- the system may compare/combine correlations based on the events from the selected time period to correlations generated based on events from previous time periods.
- the system can then generate a cumulative set of event correlations based on the analysis performed across the different time periods.
- the system determines whether any event correlations satisfy a statistical threshold ( 508 ).
- the system may compare values representing a probability of the statistical correlations to one or more thresholds to determine whether any of the correlations have a satisfactory statistical power or confidence. Additionally, as described above, the system may determine whether the probability of making a Type II error has been sufficiently reduced for one or more of the event correlations. For event sequences, the system can determine whether the event sequence has occurred a threshold number of times or a sufficient number of times to satisfy a statistical probability that the sequence is not a random occurrence and represents correlated events.
- the system waits for an additional time period of events ( 510 ). If analyzing events from a log, the system may select a collection of events from a next time period. Alternatively, the system waits until a subsequent time period has elapsed and retrieves events for that time period or waits until another batch of events is received from an event management system. The system then continues operations at block 504 .
- the system If there are correlations which satisfy the threshold, the system generates service domains based on threshold satisfying event correlations ( 512 ).
- the system identifies components corresponding to the event correlations and generates a service domain comprising the components.
- the service domain may be a topology, graph data structure, or a listing which identifies the components as belonging to a same service domain.
- the system may include information in the service domain data structure such as identified event sequences, service or network layers associated with each of the components, statistical strength of event correlations, etc.
- the system After generating at least a first service domain based on the event correlations, the system is prepared to begin root cause analysis utilizing the generated service domain represented by the operations at block 514 , 516 , 518 , and 520 .
- the system also continues refining and validating the event correlations and the generated service domains. For example, the system may add or remove components in the service domains based on additional event correlation. As a result, the system also returns to block 510 to continue performing event correlation in parallel with the root cause analysis operations.
- the system detects an occurrence of an anomalous event ( 514 ).
- Block 514 is depicted with a dashed outline to represent that the system continually monitors for the occurrence of anomalous events as a background operation and that the operations of blocks 516 , 518 , and 520 may be triggered each time an anomalous event is detected.
- the system may detect an anomalous event by identifying an event which indicate that a component issue or failure has occurred or by comparing metrics in an event to preestablished performance thresholds.
- the system selects at least a first service domain related to the anomalous event ( 516 ).
- the system determines component corresponding to the event based on a component identifier or other indicator in the event associated with the component.
- the system searches the generated service domains with the component identifier to retrieve one or more service domains which include the component.
- the system selects at least a first service domain for which to perform root cause analysis but can also perform root cause analysis for all affected service domains in parallel, as the service domains are likely all experiencing a same root cause since the service domains share the anomalous component.
- the system retrieves events related to components in the service domain ( 518 ).
- the system identifies all components in the service domain and then queries an event database to retrieve recent events for the components.
- the system may structure the query to retrieve only anomalous events for the components.
- the system identifies a root cause of the anomalous event based on the events within the service domain ( 520 ).
- the system can perform root cause analysis by identifying a lowest layer component in the service domain which is experiencing an anomalous event and identify that component as the root cause.
- the system may analyze an event sequence indicated in the service domain to identify the earliest event in the sequence which matches an event of one of the components. For example, if the sequence begins with an event of type 3 at a component A, the system determines whether an event of type 3 recently occurred at the component A. The system may continue through the sequence to determine whether there is a matching recent event for each event in the sequence.
- the system determines that the event at the component corresponding to the first event in the sequence is the root cause.
- the system outputs the root cause event identified as a result of the analysis and suppresses other alarms or events for the service domain.
- the system may also perform automated remedial actions to correct the issue. For example, if the root cause event was a router issue, the system may remotely reboot the router or invoke a script for resetting a port on the router. After identifying the root cause, the system returns to block 514 until another anomalous event is detected.
- FIGS. 1 and 3 are annotated with a series of letters. These letters represent stages of operations. Although these stages are ordered for this example, the stages illustrate one example to aid in understanding this disclosure and should not be used to limit the claims. Subject matter falling within the scope of the claims can vary with respect to the order and some of the operations.
- aspects of the disclosure may be embodied as a system, method or program code/instructions stored in one or more machine-readable media. Accordingly, aspects may take the form of hardware, software (including firmware, resident software, micro-code, etc.), or a combination of software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.”
- the functionality presented as individual modules/units in the example illustrations can be organized differently in accordance with any one of platform (operating system and/or hardware), application ecosystem, interfaces, programmer preferences, programming language, administrator preferences, etc.
- the machine readable medium may be a machine readable signal medium or a machine readable storage medium.
- a machine readable storage medium may be, for example, but not limited to, a system, apparatus, or device, that employs any one of or combination of electronic, magnetic, optical, electromagnetic, infrared, or semiconductor technology to store program code.
- machine readable storage medium More specific examples (a non-exhaustive list) of the machine readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
- a machine readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- a machine readable storage medium is not a machine readable signal medium.
- a machine readable signal medium may include a propagated data signal with machine readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
- a machine readable signal medium may be any machine readable medium that is not a machine readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a machine readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as the Java® programming language, C++ or the like; a dynamic programming language such as Python; a scripting language such as Perl programming language or PowerShell script language; and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the program code may execute entirely on a stand-alone machine, may execute in a distributed manner across multiple machines, and may execute on one machine while providing results and or accepting input on another machine.
- the program code/instructions may also be stored in a machine readable medium that can direct a machine to function in a particular manner, such that the instructions stored in the machine readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- FIG. 6 depicts an example computer system with a service domain identifier and root cause analyzer.
- the computer system includes a processor unit 601 (possibly including multiple processors, multiple cores, multiple nodes, and/or implementing multi-threading, etc.).
- the computer system includes memory 607 .
- the memory 607 may be system memory (e.g., one or more of cache, SRAM, DRAM, zero capacitor RAM, Twin Transistor RAM, eDRAM, EDO RAM, DDR RAM, EEPROM, NRAM, RRAM, SONOS, PRAM, etc.) or any one or more of the above already described possible realizations of machine-readable media.
- the computer system also includes a bus 603 (e.g., PCI, ISA, PCI-Express, HyperTransport® bus, InfiniBand® bus, NuBus, etc.) and a network interface 605 (e.g., a Fiber Channel interface, an Ethernet interface, an internet small computer system interface, SONET interface, wireless interface, etc.).
- the system also includes a service domain identifier and root cause analyzer 611 .
- the a service domain identifier and root cause analyzer 611 performs event-based identification of service domains and utilizes knowledge of the service domains in performing root cause analysis. Any one of the previously described functionalities may be partially (or entirely) implemented in hardware and/or on the processor unit 601 .
- the functionality may be implemented with an application specific integrated circuit, in logic implemented in the processor unit 601 , in a co-processor on a peripheral device or card, etc. Further, realizations may include fewer or additional components not illustrated in FIG. 6 (e.g., video cards, audio cards, additional network interfaces, peripheral devices, etc.).
- the processor unit 601 and the network interface 605 are coupled to the bus 603 . Although illustrated as being coupled to the bus 603 , the memory 607 may be coupled to the processor unit 601 .
Abstract
Description
- The disclosure generally relates to the field of information security, and more particularly to software development, installation, and management.
- Information related to interconnections among components in a system is often used for root cause analysis of system issues. For example, a network administrator or network management software may utilize network topology and network events to aid in troubleshooting issues and outages. Network topology describes connections between physical components of a network and may not describe relationships between software components. Events are generated by a variety of sources or components, including hardware and software. Events may be specified in messages that can indicate numerous activities, such as an application finishing a task or a server failure.
- Aspects of the disclosure may be better understood by referencing the accompanying drawings.
-
FIG. 1 depicts an example network management system which performs event-based identification of service domains and root cause analysis. -
FIG. 2 depicts service domains identified based on event correlation. -
FIG. 3 depicts covariance matrices used to identify relationships between components based on event correlation. -
FIG. 4 depicts an example of using event sequence mining to perform event correlation and identify relationships between components. -
FIG. 5 depicts a flowchart with example operations for performing event-based identification of service domains and root cause analysis. -
FIG. 6 depicts an example computer system with a service domain identifier and root cause analyzer. - The description that follows includes example systems, methods, techniques, and program flows that embody aspects of the disclosure. However, it is understood that this disclosure may be practiced without these specific details or be practiced in other environments. For instance, this disclosure refers to performing root cause analysis using identified service domains in illustrative examples. Aspects of this disclosure can be also applied to using identified service domains for determining single points of failure in a system or identifying other weaknesses, such as load balancing issues, for a system. In other instances, well-known instruction instances, protocols, structures and techniques have not been shown in detail in order not to obfuscate the description.
- Without knowledge of a system, detecting relationships across domains for components of a service can be difficult in the absence of topology information. While some techniques for cross domain correlation of components are available, these require significant effort to pre-load or pre-define the correlation scope. Additionally, some forms of cross domain correlation are very limited in which domains they can operate and may still require some front-end design or hard coding by experts to function properly. These hard-coded correlation techniques fail when deployed to new domains (or new network configurations or network technologies).
- To provide improved identification of service components and root cause analysis, a system uses event correlation to identify components belonging to a same service or service domain. The system correlates events by generating covariance matrices or by performing sequence mining with temporal databases in order to discover event patterns (or episodes of events) that occur sequentially in a time window. Components corresponding to the correlated events are identified as being part of a same service domain and can be indicated in a service domain data structure, such as a topology. The system utilizes the identified service domains during root cause analysis. The system can determine an anomalous event occurring a lowest layer component in a service domain as a root cause or can determine an anomalous event which occurs first in an identified event sequence of a service domain as a root cause. After identifying the root cause event, the system suppresses notifications of events occurring at other components in the service domain to avoid providing superfluous notifications through network management software to an administrator.
- The term “component” as used in the description below encompasses both hardware and software resources. The term component may refer to a physical device such as a computer, server, router, etc.; a virtualized device such as a virtual machine or virtualized network function; or software such as an application, a process of an application, database management system, etc. A component may include other components. For example, a server component may include a web service component which includes a web application component.
- The description below refers to an “event” to describe a message, indication, or notification of an event. An event is an occurrence in a system or in a component of the system at a point in time. An event often relates to resource consumption and/or state of a system or system component. As examples, an event may be that a file was added to a file system, that a number of users of an application exceeds a threshold number of users, that an amount of available memory falls below a memory amount threshold, or that a component stopped responding or failed. An event indication can reference or include information about the event and is communicated to by an agent or probe to a component/agent/process that processes event indications. Example information about an event includes an event type/code, application identifier, time of the event, severity level, event identifier, event description, etc.
- The description below refers to correlating events or event correlation. The process of event correlation involves identifying events that have a connection or relationship to one another, such as a temporal connection, cause-and-effect relationship, statistical relationship, etc. Correlating events or event correlation as used herein refers to the identification of this existing relationship and does not include modifying events to establish a connection or relationship.
- The description below uses the term “service domain” to refer to a collection of resources or components which are utilized in providing a service, such as an application, a database, a web server, etc. For example, a service domain can include a cloud storage application, a virtual machine which executes the application, a hypervisor underlying the virtual machine, a server hosting the hypervisor, and a router which connects the server to a network.
- Example Illustrations
-
FIG. 1 depicts an example network management system which performs event-based identification of service domains and root cause analysis.FIG. 1 depicts avirtual machine A 101, avirtual machine B 120, aserver 121, astorage system 122, and anevent collector 105 that are connected through anetwork 104. The virtual machine A 101 includes anapplication 102.FIG. 1 also depicts anetwork management system 110 that includes anevent correlator 107 and aroot cause analyzer 109. Theevent collector 105, theevent correlator 107, and theroot cause analyzer 109 are communicatively coupled to anevent database 106. - At stage A, the
event collector 105 receives events from components in thenetwork 104 and stores them in theevent database 106. Theevent collector 105 may receive the events from agents of the components in thenetwork 104. InFIG. 1 , theevent collector 105 receives Events 1-5 and stores them in theevent database 106.Event 1 indicates that the processor load for the virtual machine A 101 was at 95% at time 1:00, andEvent 2 indicates that the response time for thevirtual machine B 120 was 500 milliseconds at time 1:01.Event 3 indicates that theapplication 102 invoked thestorage system 122 five times at time 1:15, andEvent 4 indicates that thestorage system 122 had a response time of 100 milliseconds at time 1:16.Event 5 indicates that theserver 121 had a processor load of 85% at time 1:20. The event indications may include additional information that is not depicted. For example,Event 1 may indicate that the processor load is an average for a certain time period and may include a minimum and maximum load for the time period. Events 1-5 are examples of particular types of event indictors that may be received by theevent collector 105. Theevent collector 105 also receives and stores event indications of other types in theevent database 106 that are not depicted. - At stage B, the
event correlator 107 retrieves and correlates events in theevent database 106 to identify components forservice domains 108. Event correlation refers to the identification of a relationship or statistical connection between two or more events. Events can be correlated based on a determination that a first event caused a second event, that a first series of events caused a second series of events, that two events often occur near simultaneously, etc. Theevent correlator 107 can also correlate events based on a statistical, causal, or probability analysis using a statistical correlation/covariance matrix, as described in more detail inFIG. 3 . Theevent correlator 107 can also correlate events based on sequence mining or identification of repetitive event patterns (i.e., temporally sequential series of events), as described in more detail inFIG. 4 . InFIG. 1 , for example, theevent correlator 107 may determine that there is a correlation between theevent 3 in which theapplication 102 invokes thestorage system 122 and theevent 4 which occurs a minute later and indicates a slow response time at thestorage system 122. Theevent correlator 107 can validate correlations over multiple time periods. For example, theevent correlator 107 may increase a correlation probability based on identifying a pattern in past events indicating that an event with a slow response time for thestorage system 122 frequently occurs after events indicating invocations of thestorage system 122 by theapplication 102. - A correlation between events indicates a relationship between the corresponding components. Event correlation can reveal component relationships which may not be apparent from network topology information, and these relationships can be identified without requiring extensive manual input by an administrator. The
event correlator 107 uses the determined relationships to identify components which are part of a same service domain. Theevent correlator 107 indicates the components in theservice domains 108 which includes theexample service domain 1 115. As shown inFIG. 1 , theservice domain 1 115 includes theapplication 102, thevirtual machine A 101, thestorage system 122, a hypervisor, and a router which may also be part of thenetwork 104. Theevent correlator 107 included these components in a same service domain based on determining a correlation between events of these components, such as the example correlation described above between events of theapplication 102 and thestorage system 122. Additionally, theevent correlator 107 may have included the router, a physical layer component, in theservice domain 1 115 based on determining that the router was utilized by logical layer components such as theapplication 102. Relationships between physical and logical layer components can be identified using a reverse lookup or through a network topology provided to theevent correlator 107. The layer to which a component is assigned may be based on the Open Systems Interconnection model (OSI model). Layers can include multiple components, e.g. routers and switches may be on a same layer. Theevent correlator 107 can provide theservice domains 108 in a graph data structure that includes nodes identifying the components and edges indicating the relationships between the components or may indicate the components in a list. Theevent correlator 107 can include data in theservice domains 108 such as a type of correlation used to identify relationships, a determined probability of correlation, event attribute values, a corresponding network or service layer for each component, etc. If sequence mining was used to identify event correlations, theevent correlator 107 can label the graph data structure or otherwise indicate in the service domains 108 a sequence in which events typically occur at the components. For example, theevent correlator 107 may label theservice domain 1 115 to indicate that an event typically first occurs at theapplication 102 and then an event occurs at thestorage system 122. - At stage C, the
root cause analyzer 109 performs root cause analysis using theservice domains 108 and events in theevent database 106. Theroot cause analyzer 109 may monitor theevent database 106 to identify one or more anomalous events occurring at the components. An anomalous event is an event that indicates a network occurrence or condition that deviates from a normal or expected value or outcome. For example, an event may have an attribute value that exceeds or falls below a determined threshold or required value, or an event may indicate that a component shut down or restarted prior to a scheduled time. Additionally, an anomalous event may be an event that indicates a network issue such as a component failure. - After identifying one or more anomalous events, the
root cause analyzer 109 identifies one or more service domains from theservice domains 108 which include components corresponding to the anomalous events. Theroot cause analyzer 109 then utilizes the identified service domain(s) to aid in the root cause analysis process. For example, if an anomalous event, such as a slow response time, occurred at theapplication 102, theroot cause analyzer 109 identifies theservice domain 1 115 from theservice domains 108. Theroot cause analyzer 109 then identifies related components in theservice domain 1 115 and retrieves events for those components from theevent database 106. In one implementation, theroot cause analyzer 109 identifies an anomalous event occurring at a lowest layer component in theservice domain 1 115 and outputs that event as aroot cause event 111. For example, if a high processor load event was occurring at the hypervisor, which is a lower layer component than theapplication 102, theroot cause analyzer 109 prioritizes the high processor load event as the root cause and outputs that event as theroot cause event 111. In another implementation, theroot cause analyzer 109 may utilize an event sequence or pattern indicated in theservice domain 1 115 to identify which component typically starts the series of events resulting in an anomaly. If the event sequence is typically instigated by theapplication 102, theroot cause analyzer 109 outputs an event at theapplication 102 as theroot cause event 111. Theroot cause analyzer 109 may also output relatedevents 112 which occur at other components in theservice domain 1 115; however, as indicated by the dashed lines inFIG. 1 , therelated events 112 may be hidden or suppressed so that an administrator is not overwhelmed with alarms or notifications of anomalous events or other possible root causes. In general, theroot cause analyzer 109 suppresses events generated by the components in theservice domain 1 115 while an issue causing the anomalous events is still occurring. Theroot cause analyzer 109, or another service of thenetwork management system 110, can suppress events by filtering events using identifiers for the components in theservice domain 1 115 and preventing the filtered events from being sent for display. Once the issue has been resolved and the components in theservice domain 1 115 are functioning properly, theroot cause analyzer 109 resumes normal generation of event notifications. -
FIG. 2 depicts service domains identified based on event correlation.FIG. 2 shows aservice domain 1 201 and aservice domain 2 202 which each comprise a subset of components executing within different layers of anetwork 203. The components in the service domains may have been determined to be related based on event correlation through sequence mining or through a covariance matrix. The components are related in that they function together to a provide sessions for an IP telepresence service: theservice domain 1 201 includes asession 1 and theservice domain 2 202 includes asession 2. These sessions are supported by other components in thenetwork 203 such as IP-multicast groups, IP-quality of service (QoS) classes, and layer-3 (L3) network paths, a border gateway protocol (BGP), multiprotocol label switching paths (MPLS LSP), a virtual local area network (VLAN), and a router. - If the “
Session 1” of theservice domain 1 201 fails or encounters an issue, root cause analysis of the session can be simplified by limiting the analysis to the components in theservice domain 1 201. Additionally, other information about theservice domain 1 201 may be utilized to identify a root cause. In one implementation, root causes are inferred based on a lowest layer component in theservice domain 1 201 which is experiencing an issue. For example, if the “Group 1” IP-multicast is experiencing an issue and the router is experiencing an issue, it is determined that the router issue is the root cause of problems for theservice domain 1 201, as the router is at a lower layer than the IP-multicast. A component's layer can be determined based on a component type, an assigned or logical OSI layer, etc. Additionally, a components layer can be determined relative to other components. For example, a virtual machine is considered a higher layer than the hypervisor on which it executes. Similarly, a server which executes the hypervisor is at a higher layer than a router which it uses for transmitting network traffic. - After determining the root cause, alarms or notifications for other components in the
service domain 1 201 can be suppressed, e.g., not displayed to a user. Furthermore, if “Session 2” of theservice domain 2 202 is also experiencing issues, alarms or notifications for other components in theservice domain 2 202 can also be suppressed and, ultimately, only a single event or notification identifying the root cause is presented, thereby avoiding overloading an interface of network management software with notifications. Events for components of the service domains may be suppressed until the issue is resolved and then event notification may continue as normal. -
FIG. 3 depicts covariance matrices used to identify relationships between components based on event correlation.FIG. 3 depicts anevent correlator 307 which produces acovariance matrix 301 at a stage A and a set ofcovariance matrices 302 over multiple time periods during a stage B. The columns and rows of the matrices identify components in a network, such as the components depicted inFIG. 2 . The entries in the matrices represent the correlation or covariance of events between the components. In probability theory and statistics, covariance is a measure of the joint variability of two random variables. In the case ofFIG. 3 , the random variables are events occurring at the components. The covariance analysis of the events generates a number between 0 and 1 indicating the probability that there is a correlation between the events, 1 being a high probability and 0 being a low probability. A threshold can be set to determine whether the probability of correlation is high enough to confidently determine that two components are related and belong to a same service domain. InFIG. 3 , the threshold is 85%, and the entries in the matrices which satisfy the threshold are bolded and underlined. As seen in thematrix 301, for example, there is a 90% correlation between events of the component “Session 1” and events of the component “Group 1,” so theevent correlator 307 may determine that the components are part of a same service domain. - At stage A, the
event correlator 307 generates a first matrix, thematrix 301, based on event correlation. Theevent correlator 307 may use events from an event log from a first or most recent time period to generate thematrix 301. For example, theevent correlator 307 may use events generated in the previous 10 minutes or events from a first 30 minutes of operation of the components. Since thematrix 301 is based on correlation from just a single time period, thematrix 301 is treated as a hypothesis and is tested/validated as additional events are received and analyzed. - During stage B, the
event correlator 307 continues collecting and analyzing events over multiple time periods to generate the set ofcovariance matrices 302. As additional matrices are generated and correlations identified, the statistical power of the correlations increases, thereby decreasing the risk of making a Type II error. A type II error refers to the failure to reject a false hypothesis; a hypothesis being in this instance that there is a correlation between events of two components as shown in thematrix 301. Statistical power is inversely related to beta where beta is the probability of making a Type II error (power=1−β). Theevent correlator 307 may continue collecting and analyzing events over multiple time periods until the probability of making a Type II error falls below a threshold or, stated differently, until the statistical power has exceeded a threshold. In general, the consistency with which a correlation is identified over the multiple time periods indicates the confidence which can be place in the identified correlation. For example, if the event correlator generates three matrices over three time periods and a threshold satisfying correlation appears in all three, then theevent correlator 307 can have high confidence in the correlation and the correlation likely has high statistical power. After arriving at a statistically sound result, theevent correlator 307 can output a matrix based on an aggregation of the set ofcovariance matrices 302 or a list of related components identified based on the correlation. -
FIG. 4 depicts an example of using event sequence mining to perform event correlation and identify relationships between components.FIG. 4 depicts anevent key 401,event log 402, and a minedpattern 403 generated by anevent correlator 407. Theevent key 401 identifies known entities or components in a system along with their component identifiers. Additionally, theevent key 401 identifies types of events for those components along with event identifiers. Theevent log 402 shows a number of events stored in anevent database 406 which have been sorted according to an associated timestamp. Each event indication in theevent log 402 indicates a component identifier for the corresponding component and an event identifier for the type of event which occurred. For example, the first event indication in theevent log 402 occurred at a component with identifier “1”, which, as shown in theevent key 401, is the component “port.” Additionally, the first event indication has an event identifier of “16” which corresponds to the event type of “port down.” - The
event correlator 407 uses sequence mining on the temporal listing ofevents 402 in order to discover the patterns (or episodes of events) that occur sequentially in a fixed time window. This approach allows discovery of patterns which occur repeatedly with a high confidence index thereby making the mined pattern causal and not co-incidental. Theevent correlator 407 may mine the data using a priori algorithms to identify sequences. If an event pattern or episode is recognized within a specified timeframe with a high confidence index on causality (based on factors like number of repetitions, probabilistic distribution, etc.), then that episode is a set of events that occur one after the other and are correlated. Components associated with that set of events are then indicated as being part of a same service domain. - In
FIG. 4 , theevent correlator 407 mines the event log 402 using a mining algorithm to identify patterns or sequences of events. The sequence mining algorithm may be the PrefixSpan algorithm. As a result of mining theevent log 402, the minedpattern 403 is identified. The minedpattern 403 is the longest subsequence which is repeated in theevent log 402. The minedpattern 403 may be further processed to determine a confidence index of causality between the events in the minedpattern 403. If there is a high confidence of causality, it is determined that the events and their corresponding components are related and are part of a same service domain. Similar to the covariance matrices, theevent correlator 407 may perform the sequence mining over multiple time periods to improve the statistical power of the correlations prior to making a determination that the events in the minedpattern 403, and their corresponding components, are related. The minedpattern 403 is just one example pattern for a service domain. A service domain can include multiple event patterns or sequences involving one or more of the same components and event types. For example, instead of a sequence beginning with the port down event, a sequence may begin with a QoS violation event which causes events at the multicast and video conference components, such as slow response times. When performing root cause analysis, events generated in a service domain can be compared to the one or more event patterns associated with the service domain to determine which pattern is occurring. The identified pattern is then used to determine a root cause of an issue. -
FIG. 5 depicts a flowchart with example operations for performing event-based identification of service domains and root cause analysis. The operations ofFIG. 5 are described as being performed by a network management system for consistency withFIG. 1 , although naming of program code can vary among implementations. - A network management system (“system”) retrieves events from an event log for analysis (502). The system may query an event database to retrieve events or may subscribe to an event management service which forwards batches of events to the system. The system may sort the events into a chronological order, filter for events of a particular type, or otherwise prepare the collection of events for analysis.
- The system begins operations for multiple time periods represented by the events (504). The system may divide or split the events into time periods for processing. For example, the system may split the events into collections of five-minute periods. Alternatively, in some implementations, the system may divide the events into sets of a number of events, e.g., 100 events per set. The time period or collection of events currently being processed is hereinafter referred to as “events for the selected time period.”
- The system identifies correlations of events for the selected time period (506). The system analyzes the events and may generate a covariance matrix for components represented in the events or perform sequence mining on the events. The system may compare/combine correlations based on the events from the selected time period to correlations generated based on events from previous time periods. The system can then generate a cumulative set of event correlations based on the analysis performed across the different time periods.
- The system determines whether any event correlations satisfy a statistical threshold (508). The system may compare values representing a probability of the statistical correlations to one or more thresholds to determine whether any of the correlations have a satisfactory statistical power or confidence. Additionally, as described above, the system may determine whether the probability of making a Type II error has been sufficiently reduced for one or more of the event correlations. For event sequences, the system can determine whether the event sequence has occurred a threshold number of times or a sufficient number of times to satisfy a statistical probability that the sequence is not a random occurrence and represents correlated events.
- If no correlations satisfy the threshold, the system waits for an additional time period of events (510). If analyzing events from a log, the system may select a collection of events from a next time period. Alternatively, the system waits until a subsequent time period has elapsed and retrieves events for that time period or waits until another batch of events is received from an event management system. The system then continues operations at
block 504. - If there are correlations which satisfy the threshold, the system generates service domains based on threshold satisfying event correlations (512). The system identifies components corresponding to the event correlations and generates a service domain comprising the components. The service domain may be a topology, graph data structure, or a listing which identifies the components as belonging to a same service domain. The system may include information in the service domain data structure such as identified event sequences, service or network layers associated with each of the components, statistical strength of event correlations, etc. After generating at least a first service domain based on the event correlations, the system is prepared to begin root cause analysis utilizing the generated service domain represented by the operations at
block - The system detects an occurrence of an anomalous event (514).
Block 514 is depicted with a dashed outline to represent that the system continually monitors for the occurrence of anomalous events as a background operation and that the operations ofblocks - The system selects at least a first service domain related to the anomalous event (516). The system determines component corresponding to the event based on a component identifier or other indicator in the event associated with the component. The system then searches the generated service domains with the component identifier to retrieve one or more service domains which include the component. The system selects at least a first service domain for which to perform root cause analysis but can also perform root cause analysis for all affected service domains in parallel, as the service domains are likely all experiencing a same root cause since the service domains share the anomalous component.
- The system retrieves events related to components in the service domain (518). The system identifies all components in the service domain and then queries an event database to retrieve recent events for the components. The system may structure the query to retrieve only anomalous events for the components.
- The system identifies a root cause of the anomalous event based on the events within the service domain (520). In one implementation, the system can perform root cause analysis by identifying a lowest layer component in the service domain which is experiencing an anomalous event and identify that component as the root cause. In another implementation, the system may analyze an event sequence indicated in the service domain to identify the earliest event in the sequence which matches an event of one of the components. For example, if the sequence begins with an event of
type 3 at a component A, the system determines whether an event oftype 3 recently occurred at the component A. The system may continue through the sequence to determine whether there is a matching recent event for each event in the sequence. If there is a matching sequence of events, the system determines that the event at the component corresponding to the first event in the sequence is the root cause. The system outputs the root cause event identified as a result of the analysis and suppresses other alarms or events for the service domain. The system may also perform automated remedial actions to correct the issue. For example, if the root cause event was a router issue, the system may remotely reboot the router or invoke a script for resetting a port on the router. After identifying the root cause, the system returns to block 514 until another anomalous event is detected. - Variations
-
FIGS. 1 and 3 are annotated with a series of letters. These letters represent stages of operations. Although these stages are ordered for this example, the stages illustrate one example to aid in understanding this disclosure and should not be used to limit the claims. Subject matter falling within the scope of the claims can vary with respect to the order and some of the operations. - The flowcharts are provided to aid in understanding the illustrations and are not to be used to limit scope of the claims. The flowcharts depict example operations that can vary within the scope of the claims. Additional operations may be performed; fewer operations may be performed; the operations may be performed in parallel; and the operations may be performed in a different order. For example, the operations depicted in
blocks FIG. 5 can be performed in parallel or concurrently. Also, the iteration over multiple time periods may not be necessary if statistically sound correlations can be identified in a single iteration. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by program code. The program code may be provided to a processor of a general purpose computer, special purpose computer, or other programmable machine or apparatus. - As will be appreciated, aspects of the disclosure may be embodied as a system, method or program code/instructions stored in one or more machine-readable media. Accordingly, aspects may take the form of hardware, software (including firmware, resident software, micro-code, etc.), or a combination of software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” The functionality presented as individual modules/units in the example illustrations can be organized differently in accordance with any one of platform (operating system and/or hardware), application ecosystem, interfaces, programmer preferences, programming language, administrator preferences, etc.
- Any combination of one or more machine readable medium(s) may be utilized. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable storage medium may be, for example, but not limited to, a system, apparatus, or device, that employs any one of or combination of electronic, magnetic, optical, electromagnetic, infrared, or semiconductor technology to store program code. More specific examples (a non-exhaustive list) of the machine readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a machine readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. A machine readable storage medium is not a machine readable signal medium.
- A machine readable signal medium may include a propagated data signal with machine readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A machine readable signal medium may be any machine readable medium that is not a machine readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a machine readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as the Java® programming language, C++ or the like; a dynamic programming language such as Python; a scripting language such as Perl programming language or PowerShell script language; and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on a stand-alone machine, may execute in a distributed manner across multiple machines, and may execute on one machine while providing results and or accepting input on another machine.
- The program code/instructions may also be stored in a machine readable medium that can direct a machine to function in a particular manner, such that the instructions stored in the machine readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
-
FIG. 6 depicts an example computer system with a service domain identifier and root cause analyzer. The computer system includes a processor unit 601 (possibly including multiple processors, multiple cores, multiple nodes, and/or implementing multi-threading, etc.). The computer system includesmemory 607. Thememory 607 may be system memory (e.g., one or more of cache, SRAM, DRAM, zero capacitor RAM, Twin Transistor RAM, eDRAM, EDO RAM, DDR RAM, EEPROM, NRAM, RRAM, SONOS, PRAM, etc.) or any one or more of the above already described possible realizations of machine-readable media. The computer system also includes a bus 603 (e.g., PCI, ISA, PCI-Express, HyperTransport® bus, InfiniBand® bus, NuBus, etc.) and a network interface 605 (e.g., a Fiber Channel interface, an Ethernet interface, an internet small computer system interface, SONET interface, wireless interface, etc.). The system also includes a service domain identifier androot cause analyzer 611. The a service domain identifier androot cause analyzer 611 performs event-based identification of service domains and utilizes knowledge of the service domains in performing root cause analysis. Any one of the previously described functionalities may be partially (or entirely) implemented in hardware and/or on theprocessor unit 601. For example, the functionality may be implemented with an application specific integrated circuit, in logic implemented in theprocessor unit 601, in a co-processor on a peripheral device or card, etc. Further, realizations may include fewer or additional components not illustrated inFIG. 6 (e.g., video cards, audio cards, additional network interfaces, peripheral devices, etc.). Theprocessor unit 601 and thenetwork interface 605 are coupled to thebus 603. Although illustrated as being coupled to thebus 603, thememory 607 may be coupled to theprocessor unit 601. - While the aspects of the disclosure are described with reference to various implementations and exploitations, it will be understood that these aspects are illustrative and that the scope of the claims is not limited to them. In general, techniques for event-based identification of service domains and root cause analysis as described herein may be implemented with facilities consistent with any hardware system or hardware systems. Many variations, modifications, additions, and improvements are possible.
- Plural instances may be provided for components, operations or structures described herein as a single instance. Finally, boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the disclosure. In general, structures and functionality presented as separate components in the example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the disclosure.
- Use of the phrase “at least one of” preceding a list with the conjunction “and” should not be treated as an exclusive list and should not be construed as a list of categories with one item from each category, unless specifically stated otherwise. A clause that recites “at least one of A, B, and C” can be infringed with only one of the listed items, multiple of the listed items, and one or more of the items in the list and another item not listed.
Claims (20)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/145,553 US10616044B1 (en) | 2018-09-28 | 2018-09-28 | Event based service discovery and root cause analysis |
DE102019006539.5A DE102019006539A1 (en) | 2018-09-28 | 2019-09-16 | Event-based service detection and failure cause analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/145,553 US10616044B1 (en) | 2018-09-28 | 2018-09-28 | Event based service discovery and root cause analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
US20200106660A1 true US20200106660A1 (en) | 2020-04-02 |
US10616044B1 US10616044B1 (en) | 2020-04-07 |
Family
ID=69781137
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/145,553 Active 2038-09-29 US10616044B1 (en) | 2018-09-28 | 2018-09-28 | Event based service discovery and root cause analysis |
Country Status (2)
Country | Link |
---|---|
US (1) | US10616044B1 (en) |
DE (1) | DE102019006539A1 (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200409933A1 (en) * | 2019-06-28 | 2020-12-31 | Dynatrace Llc | Business impact analysis |
CN112231194A (en) * | 2020-12-11 | 2021-01-15 | 北京基调网络股份有限公司 | Index abnormity root analysis method and device and computer readable storage medium |
US20210058310A1 (en) * | 2019-08-19 | 2021-02-25 | Martello Technologies Corporation | System and method for evaluating network quality of service |
CN112866010A (en) * | 2021-01-04 | 2021-05-28 | 聚好看科技股份有限公司 | Fault positioning method and device |
US11153144B2 (en) * | 2018-12-06 | 2021-10-19 | Infosys Limited | System and method of automated fault correction in a network environment |
US11196613B2 (en) * | 2019-05-20 | 2021-12-07 | Microsoft Technology Licensing, Llc | Techniques for correlating service events in computer network diagnostics |
US20220166660A1 (en) * | 2020-11-23 | 2022-05-26 | Capital One Services, Llc | Identifying network issues in a cloud computing environment |
US11362902B2 (en) | 2019-05-20 | 2022-06-14 | Microsoft Technology Licensing, Llc | Techniques for correlating service events in computer network diagnostics |
US11388039B1 (en) * | 2021-04-09 | 2022-07-12 | International Business Machines Corporation | Identifying problem graphs in an information technology infrastructure network |
US11403157B1 (en) * | 2020-01-31 | 2022-08-02 | Splunk Inc. | Identifying a root cause of an error |
US11533216B2 (en) * | 2020-08-28 | 2022-12-20 | Ciena Corporation | Aggregating alarms into clusters to display service-affecting events on a graphical user interface |
US20230062778A1 (en) * | 2021-09-02 | 2023-03-02 | Fujifilm Business Innovation Corp. | Information processing apparatus, information processing method, information processing system, and non-transitory computer readable medium |
WO2023140876A1 (en) * | 2022-01-24 | 2023-07-27 | Rakuten Mobile, Inc. | Topology alarm correlation |
US20230359705A1 (en) * | 2022-05-06 | 2023-11-09 | Mapped Inc. | Automatic link prediction for points in commercial and industrial environments |
EP4310681A1 (en) * | 2022-07-18 | 2024-01-24 | Nxp B.V. | Event filtering and classification using composite events |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102022125396A1 (en) | 2022-09-30 | 2024-04-04 | Bundesdruckerei Gmbh | Predicting recurrence of dysfunction |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7865593B2 (en) * | 2008-08-07 | 2011-01-04 | At&T Intellectual Property I, L.P. | Apparatus and method for managing a network |
US9195943B2 (en) | 2013-03-12 | 2015-11-24 | Bmc Software, Inc. | Behavioral rules discovery for intelligent computing environment administration |
US9632858B2 (en) | 2013-07-28 | 2017-04-25 | OpsClarity Inc. | Organizing network performance metrics into historical anomaly dependency data |
US10469307B2 (en) * | 2017-09-26 | 2019-11-05 | Cisco Technology, Inc. | Predicting computer network equipment failure |
US10866844B2 (en) * | 2018-05-04 | 2020-12-15 | Microsoft Technology Licensing, Llc | Event domains |
-
2018
- 2018-09-28 US US16/145,553 patent/US10616044B1/en active Active
-
2019
- 2019-09-16 DE DE102019006539.5A patent/DE102019006539A1/en active Pending
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11153144B2 (en) * | 2018-12-06 | 2021-10-19 | Infosys Limited | System and method of automated fault correction in a network environment |
US11362902B2 (en) | 2019-05-20 | 2022-06-14 | Microsoft Technology Licensing, Llc | Techniques for correlating service events in computer network diagnostics |
US11196613B2 (en) * | 2019-05-20 | 2021-12-07 | Microsoft Technology Licensing, Llc | Techniques for correlating service events in computer network diagnostics |
US20200409933A1 (en) * | 2019-06-28 | 2020-12-31 | Dynatrace Llc | Business impact analysis |
US11487746B2 (en) * | 2019-06-28 | 2022-11-01 | Dynatrace Llc | Business impact analysis |
US20210058310A1 (en) * | 2019-08-19 | 2021-02-25 | Martello Technologies Corporation | System and method for evaluating network quality of service |
US11797366B1 (en) * | 2020-01-31 | 2023-10-24 | Splunk Inc. | Identifying a root cause of an error |
US11403157B1 (en) * | 2020-01-31 | 2022-08-02 | Splunk Inc. | Identifying a root cause of an error |
US11533216B2 (en) * | 2020-08-28 | 2022-12-20 | Ciena Corporation | Aggregating alarms into clusters to display service-affecting events on a graphical user interface |
US20220166660A1 (en) * | 2020-11-23 | 2022-05-26 | Capital One Services, Llc | Identifying network issues in a cloud computing environment |
WO2022109472A1 (en) * | 2020-11-23 | 2022-05-27 | Capital One Services, Llc | Identifying network issues in a cloud computing environment |
CN112231194A (en) * | 2020-12-11 | 2021-01-15 | 北京基调网络股份有限公司 | Index abnormity root analysis method and device and computer readable storage medium |
CN112866010A (en) * | 2021-01-04 | 2021-05-28 | 聚好看科技股份有限公司 | Fault positioning method and device |
US11388039B1 (en) * | 2021-04-09 | 2022-07-12 | International Business Machines Corporation | Identifying problem graphs in an information technology infrastructure network |
US20230062778A1 (en) * | 2021-09-02 | 2023-03-02 | Fujifilm Business Innovation Corp. | Information processing apparatus, information processing method, information processing system, and non-transitory computer readable medium |
WO2023140876A1 (en) * | 2022-01-24 | 2023-07-27 | Rakuten Mobile, Inc. | Topology alarm correlation |
US20230359705A1 (en) * | 2022-05-06 | 2023-11-09 | Mapped Inc. | Automatic link prediction for points in commercial and industrial environments |
EP4310681A1 (en) * | 2022-07-18 | 2024-01-24 | Nxp B.V. | Event filtering and classification using composite events |
Also Published As
Publication number | Publication date |
---|---|
US10616044B1 (en) | 2020-04-07 |
DE102019006539A1 (en) | 2020-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10616044B1 (en) | Event based service discovery and root cause analysis | |
US9979608B2 (en) | Context graph generation | |
US10977154B2 (en) | Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data | |
US11601349B2 (en) | System and method of detecting hidden processes by analyzing packet flows | |
US9652316B2 (en) | Preventing and servicing system errors with event pattern correlation | |
Lou et al. | Mining dependency in distributed systems through unstructured logs analysis | |
US9836952B2 (en) | Alarm causality templates for network function virtualization | |
Nguyen et al. | Pal: P ropagation-aware a nomaly l ocalization for cloud hosted distributed applications | |
US20200021511A1 (en) | Performance analysis for transport networks using frequent log sequence discovery | |
US20170279660A1 (en) | Context graph augmentation | |
WO2003005200A1 (en) | Method and system for correlating and determining root causes of system and enterprise events | |
US20140189086A1 (en) | Comparing node states to detect anomalies | |
CN110716842B (en) | Cluster fault detection method and device | |
CN109150619B (en) | Fault diagnosis method and system based on network flow data | |
CN113268399B (en) | Alarm processing method and device and electronic equipment | |
US20180176095A1 (en) | Data analytics rendering for triage efficiency | |
CN113259168A (en) | Fault root cause analysis method and device | |
Xu et al. | Logdc: Problem diagnosis for declartively-deployed cloud applications with log | |
US20200099570A1 (en) | Cross-domain topological alarm suppression | |
US10884805B2 (en) | Dynamically configurable operation information collection | |
CN113918374A (en) | Root cause analysis method, device and equipment of operation and maintenance system | |
US9443196B1 (en) | Method and apparatus for problem analysis using a causal map | |
CN108154343B (en) | Emergency processing method and system for enterprise-level information system | |
US10324818B2 (en) | Data analytics correlation for heterogeneous monitoring systems | |
JP2017521802A (en) | Architecture for correlation events for supercomputer monitoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: CA, INC., NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KAKANI, BALRAM REDDY;PULI, RAVINDRA KUMAR;GUPTA, SMRATI;REEL/FRAME:047004/0563 Effective date: 20180926 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |