WO2022194385A1 - Method and apparatus for multidimensional root cause analysis - Google Patents

Method and apparatus for multidimensional root cause analysis Download PDF

Info

Publication number
WO2022194385A1
WO2022194385A1 PCT/EP2021/057045 EP2021057045W WO2022194385A1 WO 2022194385 A1 WO2022194385 A1 WO 2022194385A1 EP 2021057045 W EP2021057045 W EP 2021057045W WO 2022194385 A1 WO2022194385 A1 WO 2022194385A1
Authority
WO
WIPO (PCT)
Prior art keywords
kpi
changes
performance indicator
key performance
data
Prior art date
Application number
PCT/EP2021/057045
Other languages
French (fr)
Inventor
Cristian-Alexandru Olariu
MingXue Wang
Peng Hu
Chao Ma
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2021/057045 priority Critical patent/WO2022194385A1/en
Publication of WO2022194385A1 publication Critical patent/WO2022194385A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5009Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF]

Definitions

  • the disclosure relates generally to apparatus including data processing arrangements for implementing multidimensional root cause analysis; for example, the disclosure relates to apparatus for implementing multidimensional root analysis for reconfiguring systems to improve their reliability and operating performance. Moreover, the disclosure relates to methods for (namely, to methods of) using aforesaid apparatus to determine parameters that affect operating performance of systems, wherein the operating performance is defined by at least one key performance indicator (KPI); for example, the disclosure relates to methods for using the determined parameters to improve reliability and operating performance of the systems.
  • KPI key performance indicator
  • An operating performance of a given system is an important measure of how the given system is technically functioning. In many situations, considerable effort is expended to improve operating performances of systems, for example to increase their reliability, to reduce their latency, to reduce drop-out occurrences therein, to improve their signal-to- noise, and so forth. Data required to monitor performances of systems are often associated with many dimensions. Examples of multidimensional data that arise in Information and Communication Technology (ICT) operation and management include, for example, microservice interaction data, client trace data, mobile application promote download data. Numerous dimensions of ICT operations and management scenarios are categorical, and each dimension has one or more different categories, or dimension value, or items. Such a vast variety of data dimensions and their items requires strict processing and pipeline integration.
  • ICT Information and Communication Technology
  • KPI key performance indicator
  • Contemporary known systems beneficially use a complex monitoring apparatus that uses anomaly detection to monitor one or more key performance indicators (KPI) of the systems at a high level.
  • KPI key performance indicators
  • the apparatus is configured to perform root cause analysis only once an anomaly is detected, making assumptions that there is only one single cause triggering an alarm indicative of inadequate KPI’s.
  • a root cause analysis is used to find one or more problems in a given system, given that a diagnosis is performed on a larger timeframe.
  • these one or more problems can co-occur and still not be correlated.
  • finding and solving these one or more problems are crucial in maintaining a high performance of the given system.
  • a complete picture of all the one or more problems is essential.
  • KPI key performance indicator
  • the disclosure provides a method for using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (KPI) and an apparatus that is configured to implement the above method.
  • KPI key performance indicator
  • the method includes collecting operations data from the system.
  • the method includes using the data processing arrangement to categorize the operations data according to dimensions and corresponding dimension values.
  • the method includes using the data processing arrangement to determine associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI).
  • the method includes determining a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) to determine which changes most sensitively affect the at least one key performance indicator (KPI).
  • the method detects the root cause of operating performance degradation in the system.
  • the root cause is reflected in the data processing arrangement that is used to monitor the performance of systems.
  • the method enhances a performance of the system by monitoring multidimensional data and finding a dimension and a dimension value manifesting highest influence on the key performance indicator (KPI).
  • KPI key performance indicator
  • a combination of dimensions and dimension values are considered for an accurate multidimensional root cause analysis. It will be appreciated that such root cause analysis based on a combination of dimensions and dimension values is very hard to humans to perform manually. The method enables faster diagnosis than human manual process while finding a correct root cause in a large number of items to check against the KRG s degradation.
  • the method can be performed for any domain that collects data as events characterized by categorical or numerical data, and has a goal of finding the item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user.
  • the method can be employed as an anomaly detection solution when run periodically to build a baseline (common unfixable problems, for example a user using a broken version).
  • an apparatus configured to implement a method of using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (PKI).
  • the apparatus includes a data collecting arrangement for collecting operations data from the system.
  • the data processing arrangement is configured to categorize the operations data according to dimensions and corresponding dimension values.
  • the data processing arrangement is configured to determine associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI).
  • KPI key performance indicator
  • the data processing arrangement is configured to determine a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) to determine which changes most sensitively affect the at least one key performance indicator (KPI).
  • the apparatus detects the root cause of operating performance degradation in the system.
  • the root cause is reflected in the data processing arrangement that is used to monitor the performance of systems.
  • the apparatus enhances a performance of the system by monitoring multidimensional data and finding a dimension and a dimension value manifesting the highest influence on the key performance indicator (KPI).
  • KPI key performance indicator
  • a combination of dimensions and dimension values are considered for an accurate multidimensional root cause analysis.
  • the apparatus enables faster diagnosis than human manual process while finding a correct root cause in a large number of items to check against the KPI’s degradation.
  • the apparatus can perform for any domain that collects data as events characterized by categorical or numerical data, and has a goal of finding the item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user.
  • the apparatus can be employed as an anomaly detection solution when run periodically to build a baseline (common unfixable problems, for example a user using a broken version).
  • a computer program product comprising a non-transitory computer-readable storage medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a computerized device comprising processing hardware to execute a method.
  • the method of the first aspect, and the apparatus of the second aspect provide a solution to the aforesaid problems of finding a relevant dimension and its item, or multiple combinations of dimensions and their items, that manifest(s) the highest influence on the behaviour of the KPI (for example temporal delay, success rate, download count, and so forth).
  • the aforesaid method and the apparatus are able to determine parameters that are a root cause of operating performance degradation in the system, wherein the root cause is reflected in the data processing arrangement used to monitor the performance of systems.
  • the method enhances the performance of the system by monitoring multidimensional data and finding a dimension and a dimension value manifesting a highest influence on a given key performance indicator (KPI).
  • KPI key performance indicator
  • a combination of dimensions and dimension values are considered for an accurate multidimensional root cause analysis.
  • the method enables faster diagnosis than human manual processing to be achieved, while finding a correct root cause in a large number of items to check against the KPI’s degradation.
  • the method is susceptible to being used for any domain that collects data such as events characterized by categorical or numerical data, and has a goal of finding item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user.
  • the method can be employed as an anomaly detection solution when run periodically to build a baseline (common unfixable problems, for example a user using a broken version).
  • FIG. 1 is a block diagram of an apparatus that is configured to implement a method of using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (PKI) in accordance with an implementation of the disclosure;
  • PKI key performance indicator
  • FIG. 2 is a block diagram of a data processing arrangement in accordance with an implementation of the disclosure
  • FIG. 3 is a block diagram of a multidimensional analysis module in accordance with an implementation of the disclosure.
  • FIG. 4 is an illustration of an exemplary view of a regression process of a decision tree fitting module in accordance with an implementation of the disclosure
  • FIGS. 5 A to 5B are illustrations of exemplary views of a process for determining a relevance of each leaf node using a post-processing module in accordance with an implementation of the disclosure
  • FIGS. 6A to 6C are illustrations of exemplary views of a pruning process of each leaf node in accordance with an implementation of the disclosure
  • FIG. 7 is an illustration of an exemplary graphical user interface that depicts a summary integration of a multidimensional root cause analysis in accordance with an implementation of the disclosure
  • FIGS. 8 A to 8B are flow diagrams that illustrate a method for using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (KPI) in accordance with an implementation of the disclosure.
  • KPI key performance indicator
  • FIG. 9 is an illustration of an exemplary apparatus, or a computer system in which the various architectures and functionalities of the various previous implementations may be implemented.
  • Implementations of the disclosure provide a method for using a data processing arrangement and an apparatus that is configured to implement the method for using the data processing arrangement.
  • a process, a method, a system, a product, or a device that includes a series of steps or units is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.
  • FIG. 1 is a block diagram of an apparatus 100 that is configured to implement a method for (namely, a method of) using a data processing arrangement 104 for determining parameters that affect operating performance of a system 106 defined by at least one key performance indicator (PKI) in accordance with an implementation of the disclosure.
  • the apparatus 100 includes a data collecting arrangement 102 for collecting operations data from the system 106.
  • the data processing arrangement 104 is configured to categorize the operations data according to dimensions and corresponding dimension values.
  • the data processing arrangement 104 is configured to determine associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI).
  • KPI key performance indicator
  • the data processing arrangement 104 is configured to determine a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) to determine which changes most sensitively affect the at least one key performance indicator (KPI).
  • the apparatus 100 aims at identifying the dimension, or dimension values, or a combination thereof causing one or more Key Performance Indicators, KPI’s, to degrade, for example a Key Performance Indicator, KPI.
  • KPI Key Performance Indicators
  • the collected data from the system 106 along with the Key Performance Indicator, KPI indicate the multitude of factors affecting the operating performance of the system 106.
  • the system 106 is diagnosed to indicate and provide a multidimensional root cause analysis to find one or more dimensions responsible for operating performance degradation on a larger time frame.
  • the data processing arrangement 104 provides supporting data and metrics which may identify the root cause and the exact dimensions causing degradation. A chain of dimensions and dimension values is created in order to lead to a terminal node explaining the detection of the root cause and its analysis.
  • the apparatus 100 enables faster diagnosis than human manual processes and analysis, while finding a correct root cause in a large number of items (for example many millions) to check against the KPI’s degradation.
  • the apparatus 100 can perform for any domain that collects data as events characterized by categorical or numerical data and has a goal of finding the item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user.
  • the apparatus 100 can be employed as an Anomaly Detection solution when run periodically in order to build a baseline (common unfixable problems, for example user using a broken version)
  • FIG. 2 is a block diagram of a data processing arrangement 200 in accordance with an implementation of the disclosure.
  • the data processing arrangement 200 includes an alarm system 202, a data source 204, a multidimensional analysis module 206, a business configuration module 208, a domain-expert knowledge module 210, a multidimensional understanding module 212, and a root cause summarizer 214.
  • the alarm system 202 monitors a Key performance Indicator, KPI, which triggers alarms when an overall performance of a system is unsatisfactory (for example, through anomaly detection).
  • KPI Key performance Indicator
  • the alarms may provide a context needed for retrieving data associated with a timeframe and a scope of a problem.
  • a multidimensional analysis process is triggered at the multidimensional analysis module 206.
  • the data source 204 is a source of multi-categorical multi-dimensional data that is associated with a target value (for example, KPI).
  • the data source 204 may be queried by the multidimensional analysis module 206 for data contextual to the alarm system 202. For example, the data source 204 may obtain data for a certain microservice for 10 minutes leading to the triggering of the alarms, as well as data for the same period but 1 day before to be used as reference.
  • the business configuration module 208 may be use-case specific and is required only once when the multidimensional analysis module 206 is deployed.
  • the business configuration module 208 includes names of dimensions to be analysed, for direct analysis by the business configuration module 208, as well as names of the dimensions used in post-processing and domain expert knowledge-based causality inference.
  • the business configuration module 208 may include other analysis specific details, such as depth of a tree or a minimum number of samples per leaf. The minimum number of samples per leaf may be a number of observations each leaf must have or else a split cannot occur; this is used for both shortening algorithm’s recursive cycles as well as fulfilling use-case specific requirements if these are necessary.
  • the business configuration module 208 may include an operation mode of an analysis, which can be single problem mode when used in conjunction with the alarms, or multi-problem mode when the multidimensional analysis module 206 is triggered manually and a diagnosis of a monitored system is desired.
  • the multidimensional analysis module 206 obtains (i) Key performance Indicator, KPI, data, and associated multidimensional data, (ii) a business configuration, and (iii) domain expert knowledge as an input.
  • the domain expert-knowledge module 210 is used to assess a solution in the scope of the use-case (for example, dimension dependency, dimension hierarchy, architecture composition, causal inference rules). This may be fed as an input to the Multidimensional understanding module 212.
  • the multidimensional understanding module 212 recommends a root cause based on a user’s configuration and the domain expert knowledge obtained from the domain expert-knowledge module 210 and based on an output of the multidimensional analysis module 206.
  • the multidimensional understanding module 212 may allow a user to specify dimensions to have their statistics reported on for those leaf nodes selected by the tree as root-cause. Examples of such dimensions include dimensions that are too granular to be useful to the multidimensional analysis module 206, for example, timestamps or unique IDs of a transaction. However, reporting these dimensions once the multidimensional understanding module 212 converged to a recommendation is useful as the dimensions may be fused with domain expertise.
  • the domain expertise may comprise known relationships between the dimensions, which can be transformed into rules that may report an actionable root cause.
  • a specific software application version has to affect a large proportion of users using the same version in order to be the root cause; if the root cause affects a very small percentage of users (for example, ⁇ 0.1%), then it is a “single user” problem.
  • the user’s county and phone model cannot be part of a combination of dimensions, as the scope of the phone model does not depend on the location of a user.
  • domain knowledge is to organize the dimensions into categories, such as network-, or client-, or content-, or server-related dimensions. This may link the root cause to a location of a problem, which in turn may trigger different corrective actions. All these dimension dependencies and logical, or location-based grouping stems from domain knowledge are used for post-processing by the multidimensional understanding module 212 for root cause suggestion.
  • the root cause summarizer 214 suggests a root cause based on data, configuration, and the domain expert knowledge obtained from the domain expert-knowledge module 210.
  • the root cause summarizer 214 may suggest the root cause as consumable output by the system 200 using the items and their importance to present the order in which the problem has propagated through the data.
  • the root cause summarizer 214 may use the data in the problematic leafs to explain how and by how much the distribution of the KPFs value is different or has changed from the expected behaviour.
  • the root cause summarizer 214 may explain which domain expert-rules have been met and how the decision was adjusted according to those insights.
  • the root cause summarizer 214 provides a summary that lists all the possible problems based on their relative severity.
  • FIG. 3 is a block diagram of a multidimensional analysis module 300 in accordance with an implementation of the disclosure.
  • the multidimensional analysis module 300 includes an input obtaining module 302, a pre-processing module 304, an auto-configuration module 306, a decision tree fitting module 308, a post-processing module 310, and a decision tree-based root cause module 312.
  • the input obtaining module 302 obtains (i) Key performance Indicator, KPI, data, and associated multidimensional data, (ii) a business configuration, and (iii) domain expert knowledge as input data.
  • the input obtaining module 302 obtains raw multidimensional data, reference data, and configurations such as use-case specific requirements, detectable feature names, supporting feature names, and operation modes (for example, single versus multiple problems).
  • the pre-processing module 304 is used to ensure that the data (for example, raw multidimensional data, reference data, and configurations data) is consistent in terms of input to the algorithm, such as filling missing values for certain dimensions with certain values or directly dropping
  • the auto -configuration module 306 ensures that data is consistent in terms of input to an algorithm. For example, if microservices are monitored for success rate of the fulfilled requests, then a Key performance Indicator, KPI, may be a Boolean variable where True means a successful request and False means a failure in fulfilling a request. In this case, a possible transformation is to map the True values to 0 and the False values to 1. Similarly, if the Key performance Indicator, KPI, is representing a delay/latency in fulfilling those requests, then it is possible to use a KPI value directly, as low values such as 0 represent a good behaviour/performance whereas high values mean high latency in fulfilling the requests which are typically associated with a reason of performance degradation.
  • KPI Key performance Indicator
  • the auto-configuration module 306 transforms the data to suit and expedite the decision tree fitting module 308.
  • the auto-configuration module 306 may re-compute each item of each dimension as a contribution difference to its typical or expected influence on the KPI, thereby more accurately pointing to an item or combination of items that reflect the root cause of a problem.
  • the auto-configuration module 306 depths at which a regression tree may stop can be dynamically computed if unspecified in a configuration. For example, this can be computed by taking twice a number of features as maximum depth to ensure that each dimension has space to have at least two of its items singled out as problematic.
  • the decision tree fitting module 308 obtains the configuration and formatted data and fits a regression decision tree.
  • the regression decision tree may iteratively split the data across different items, each time optimizing for an error or variance of the KPI of the ensuing data.
  • the post-processing module 310 computes relevance of each leaf node by subtracting a minimum of the KPI’s average across all leaf nodes from a maximum of the KPI’s average, and this yields a range of where the KPI’s average lies. This range is divided by two and this results in a relevance threshold. Leaf nodes above the relevance threshold are considered relevant candidates of being a root cause.
  • the decision tree -based root cause module 312 provides one or more problems associated with degradation of the KPI based on the configuration such as top 1 or top N problems. If top 1 problem is selected, then a leaf node is selected and presented together with a path from a root node to itself as the root cause. This can be achieved by selecting the leaf node with a highest KPI average value, or the same but weighted according to a number of samples in that node relative to the whole dataset.
  • top N problems mode If top N problems mode is selected then all relevant leaves are selected, their paths are being aggregated and a summary is proposed. This may be an indication of various problems across a system that is monitored and can be used as an overall system health check across a longer period of time. For the top N problems mode, redundancy may be directly influenced by the relevance threshold, and this can be used as a sensitivity setting.
  • FIG. 4 illustrates an exemplary view of a regression process of a decision tree fitting module 400 procedure in accordance with an implementation of the disclosure.
  • the decision tree fitting module 400 may perform an iterative process 402 to split data across different items into leaf nodes 406A and 406B. Each item is optimised for an error or variance of the KPI of the ensuing data.
  • the decision tree fitting module 400 determines a set of parameters ("branches") that are broken down into sub-sets (“leaves” or leaf nodes) to determine which parameters in sub-sets (namely, leaf nodes 406A and 406B) are most sensitively affecting the KPI's.
  • the decision tree fitting module 400 includes two types of nodes: a decision node 404 showing the item leading to a split and terminal or leaf nodes where the iterative process 402 stops.
  • the variance reduction between the decision node 404 and the leaf nodes 406A-B may be used to determine which changes most sensitively affect the at least one key performance indicator (KPI).
  • KPI key performance indicator
  • the item has minimum variance between the leaf node (406A and 406B) across the different items in a dataset.
  • the path of decision from the decision node 404 namely, a root node
  • each of the leaf nodes (406A-F) is a list of items on which the decision was made for the split.
  • the leaf nodes (406A-F) provide insight to whether the key performance indicator (KPI) has been impacted negatively or not by a specific leaf node.
  • FIGS. 5A and 5B are illustrations of exemplary views of determining a relevance of each leaf node using a post-processing module in accordance with an implementation of the disclosure.
  • the exemplary views include a decision node 502A showing the item leading to a split and leaf nodes 502B-S for each of which the relevancy is determined.
  • each leaf node 502B-S The relevance of each leaf node 502B-S is determined by subtracting a minimum average of the key processing indicator (KPI) across all leaf nodes (namely, circled leaf nodes 502F, 502G, 5021, 502J, 502K, 502M, 5020, 502Q, 502S) from the maximum average of the key processing indicator (KPI) across all the leaf nodes, which yields a range of the key performance indicator’s (KPI) average.
  • the range of the key performance indicator’s (KPI) average is divided by two to obtain a relevant threshold.
  • the leaf nodes 502B-S that are above the relevant threshold are considered as a relevant candidate (for example, 502F, 502G, 502J) of being a root cause or affecting the key performance indicator (KPI) of a system.
  • a relevant candidate for example, 502F, 502G, 502J
  • KPI key performance indicator
  • FIGS. 6 A and 6B illustrate exemplary views of a pruning process of each leaf node in accordance with an implementation of the disclosure.
  • the exemplary views show the decision node 502A showing the item leading to a split and the leaf nodes 502B-S on which the pruning process has occurred.
  • the pruning process as shown in FIGS. 6A and 6B depicts that two or more leaf nodes 502B-S with the same relevance provide no extra information, hence these leaf nodes 502B-S may be pruned in order to shorten a tree, shortening the paths leading to the leaf nodes 502B-S. Pruning of the leaf nodes 502B-S may remove items with redundant splits.
  • the leaf nodes 502B-S with the same relevance level are removed and their data is aggregated back to their corresponding parent node, which may become a leaf node, and the split that occurred at that leaf node is hence removed. This pruning process is repeated until all redundancy has been removed.
  • the remaining tree may include a smaller number of leaf nodes and shorter decision paths.
  • the tree is pruned in order to ensure the decision paths to the leaf nodes 502B-S do not include decisions with small gain in information. As such, if two leaf nodes of a same parent node have the same relevance level, then the split of those two leaf nodes that occurred at the parent node is considered redundant, and the split is reverted by pruning those two leaf nodes.
  • both leaf nodes are highly unlikely to be the root cause.
  • the split offers a small gain of information. In other words, only the splits have occurred where one of the ensuing leaf nodes has met the cut-off threshold and the other one did not.
  • the item that is chosen splits the data in two leaf nodes, one leaf node is highly associated with the KPFs degradation and the other leaf node, respectively.
  • FIG. 7 is an illustration of an exemplary graphical user interface that depicts a summary integration of a multidimensional root cause analysis in accordance with an implementation of the disclosure.
  • the graphical user interface depicts an alert list of incidents and their multidimensional root cause analysis in a table 702.
  • the graphical user interface includes a trace graph analytics field 704 and a trace log multidimensional field 706.
  • the trace graph analytics field 704 shows a problem for a selected alarm and the trace log multidimensional field 706 shows the details of a single alert.
  • FIGS. 8 A to 8C are flow diagrams that illustrate a method for using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (KPI) in accordance with an implementation of the disclosure.
  • operations data are collected from the system.
  • the operations data are categorized according to dimensions and corresponding dimension values using the data processing arrangement.
  • associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI) are determined using the data processing arrangement.
  • a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) is determined to determine which changes most sensitively affect the at least one key performance indicator (KPI).
  • the method detects the root cause of operating performance degradation in the system.
  • the root cause is reflected in the data processing arrangement that is used to monitor the performance of systems.
  • the method enhances a performance of the system by monitoring multidimensional data and finding a dimension and a dimension value manifesting highest influence on the key performance indicator (KPI).
  • KPI key performance indicator
  • a combination of dimensions and dimension values are considered for an accurate multidimensional root cause analysis.
  • the method enables faster diagnosis than is feasible using human manual processing while finding a correct root cause in a large number of items (for example, many millions) to check against the KPI’s degradation.
  • the method can be performed for any domain that collects data as events characterized by categorical or numerical data and has a goal of finding the item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user.
  • the method can be employed as an Anomaly Detection solution when run periodically to build a baseline (common unfixable problems, for example a user using a broken version).
  • the method determines particular parameters (for example, sensitivity factor) that strongly influence particular KPI's. For example, user numbers may correlate strongly with latency as a KPI, alternatively, certain models of smart phones may be more prone to drop-out rate where the drop-out rate is a KPI. Some smart phones may have a version of software that is strongly correlated with a poor sound quality, where the sound quality is a KPI. Some models of smart phones with a certain version of software (namely, a combination of two factors) may be prone to latching up (namely, software crash) where the operating reliability is a KPI.
  • sensitivity factor for example, user numbers may correlate strongly with latency as a KPI, alternatively, certain models of smart phones may be more prone to drop-out rate where the drop-out rate is a KPI. Some smart phones may have a version of software that is strongly correlated with a poor sound quality, where the sound quality is a KPI. Some models of smart phones with a certain version of software (namely, a combination of two factors)
  • the sensitivity factor of the one or more changes is determined over one or more times periods of mutually different durations to reduce an effect of stochastic noise in the operations data when determining the sensitivity factor.
  • One of the time periods may be a reference time period.
  • the method includes performing steps (iii) and (iv) iteratively until the dimensions and corresponding dimensional values that result in a maximum value of the sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) are determined.
  • the method includes determining a change in the sensitivity factor resulting from an interactive combination of changes in the dimension values of the dimensions. The method may include determining one or more binary state changes in the dimension values.
  • the method includes determining from the changes that most sensitively affect the at least one key performance indicator (KPI) one or more technical problems affecting the system.
  • the method includes using in (ii) a regression decision tree for implementing iteratively a splitting process to split the dimension values according to pre-defined criteria into one or more data sets to determine an influence of the dimension values on the at least one key performance indicator (KPI).
  • the method includes repetitively implementing the splitting process such that a split is found that creates a smallest sum of mean squared errors in the one or more data sets of the dimension values, until a stopping criterion is met.
  • each decisional split is taken across the data where one certain item is present. The item is selected according to a difference between a negative influence that item has on the KPI as compared to remaining data where the item is not present.
  • the chain of items leading to a terminal node explains how the root cause was detected, as well as the importance of each item in that chain. Results are expressed along with supporting data and metrics computed to conclude.
  • the method does not require a threshold setting.
  • the metrics used in determining which item to split by is relative as it is computed based on the KPI’s value associated with a particular item, and then used to compare item by item and the highest contribution wins.
  • the method further includes determining from the sensitivity factor and its association the at least one key performance indicator (KPI) one or more problems in operation of the system causing a degradation in performance of the system.
  • KPI key performance indicator
  • the decision tree may split the dimension to lead a chain of items, further leading to a terminal node, detecting and explaining the root cause for operating performance degradation.
  • a computer program product including a non-transitory computer-readable storage medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a computerized device including processing hardware to execute a method.
  • FIG. 9 is an illustration of an exemplary apparatus, or a computer system (900 in which the various architectures and functionalities of the various previous implementations may be implemented.
  • the computer system 900 includes at least one processor 904 that is connected to a bus 902, wherein the computer system 900 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI- Express, AGP (Accelerated Graphics Port), Hyper Transport, or any other bus or point- to-point communication protocol (s).
  • the computer system 900 also includes a memory 906.
  • Control logic (software) and data are stored in the memory 906 which may take a form of random- access memory (RAM).
  • RAM random- access memory
  • a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.
  • the computer system 900 may also include a secondary storage 910.
  • the secondary storage 910 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory.
  • the removable storage drive at least one of reads from and writes to a removable storage unit in a well-known manner.
  • Computer programs, or computer control logic algorithms may be stored in at least one of the memory 906 and the secondary storage 910. Such computer programs, when executed, enable the computer system 900 to perform various functions as described in the foregoing.
  • the memory 906, the secondary storage 910, and any other storage are possible examples of computer-readable media.
  • the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 904, a graphics processor coupled to a communication interface 912, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 904 and a graphics processor, a chipset (namely, a group of integrated circuits designed to work and sold as a unit for performing related functions, and so forth).
  • the architectures and functionalities depicted in the various previous- described figures may be implemented in a context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system.
  • the computer system 900 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.
  • the computer system 900 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, and so forth. Additionally, although not shown, the computer system 900 may be coupled to a network (for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 908.
  • a network for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like
  • I/O interface 908 for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

An apparatus (100) is configured to implement a method for using a data processing arrangement (104, 200) for determining parameters that affect operating performance of a system (106) defined by at least one key performance indicator (PKI). The apparatus includes a data collecting arrangement (102) for collecting operations data from the system. The data processing arrangement is configured to categorize the operations data according to dimensions and corresponding dimension values. The data processing arrangement is configured to determine associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI). The data processing arrangement is configured to determine a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) to determine which changes most sensitively affect the at least one key performance indicator (KPI).

Description

METHOD AND APPARATUS FOR MULTIDIMENSIONAL ROOT CAUSE
ANALYSIS
TECHNICAL FIELD
The disclosure relates generally to apparatus including data processing arrangements for implementing multidimensional root cause analysis; for example, the disclosure relates to apparatus for implementing multidimensional root analysis for reconfiguring systems to improve their reliability and operating performance. Moreover, the disclosure relates to methods for (namely, to methods of) using aforesaid apparatus to determine parameters that affect operating performance of systems, wherein the operating performance is defined by at least one key performance indicator (KPI); for example, the disclosure relates to methods for using the determined parameters to improve reliability and operating performance of the systems.
BACKGROUND
An operating performance of a given system is an important measure of how the given system is technically functioning. In many situations, considerable effort is expended to improve operating performances of systems, for example to increase their reliability, to reduce their latency, to reduce drop-out occurrences therein, to improve their signal-to- noise, and so forth. Data required to monitor performances of systems are often associated with many dimensions. Examples of multidimensional data that arise in Information and Communication Technology (ICT) operation and management include, for example, microservice interaction data, client trace data, mobile application promote download data. Numerous dimensions of ICT operations and management scenarios are categorical, and each dimension has one or more different categories, or dimension value, or items. Such a vast variety of data dimensions and their items requires strict processing and pipeline integration. In many contemporary situations, it is sometimes tedious to spot and highlight a dimension or its item or a combination thereof, causing a key performance indicator (KPI) of a given system to degrade. Detection of a multitude of problems in the given system depends on identifying a combination of dimensions affecting the key performance indicator (KPI). Such detection is often a more challenging task than identifying only one single influential dimension.
Contemporary known systems beneficially use a complex monitoring apparatus that uses anomaly detection to monitor one or more key performance indicators (KPI) of the systems at a high level. The apparatus is configured to perform root cause analysis only once an anomaly is detected, making assumptions that there is only one single cause triggering an alarm indicative of inadequate KPI’s. However, when performing overall system diagnosis, a root cause analysis is used to find one or more problems in a given system, given that a diagnosis is performed on a larger timeframe. In known contemporary systems, these one or more problems can co-occur and still not be correlated. Moreover, finding and solving these one or more problems are crucial in maintaining a high performance of the given system. However, to achieve that, a complete picture of all the one or more problems is essential.
Therefore, there arises a need to address the aforementioned technical problem/drawbacks in performing a multidimensional root cause analysis for improving performance of systems.
SUMMARY
It is an object of the disclosure to provide an improved method for using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (KPI), and an improved apparatus that is configured to implement the aforesaid method.
This object is achieved by the features of the independent claims. Further, implementation forms are apparent from the dependent claims, the description, and the figures.
The disclosure provides a method for using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (KPI) and an apparatus that is configured to implement the above method.
According to a first aspect, there is provided a method for using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (KPI). The method includes collecting operations data from the system. The method includes using the data processing arrangement to categorize the operations data according to dimensions and corresponding dimension values. The method includes using the data processing arrangement to determine associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI). The method includes determining a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) to determine which changes most sensitively affect the at least one key performance indicator (KPI).
The method detects the root cause of operating performance degradation in the system. The root cause is reflected in the data processing arrangement that is used to monitor the performance of systems. The method enhances a performance of the system by monitoring multidimensional data and finding a dimension and a dimension value manifesting highest influence on the key performance indicator (KPI). A combination of dimensions and dimension values are considered for an accurate multidimensional root cause analysis. It will be appreciated that such root cause analysis based on a combination of dimensions and dimension values is very hard to humans to perform manually. The method enables faster diagnosis than human manual process while finding a correct root cause in a large number of items to check against the KRG s degradation. The method can be performed for any domain that collects data as events characterized by categorical or numerical data, and has a goal of finding the item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user. The method can be employed as an anomaly detection solution when run periodically to build a baseline (common unfixable problems, for example a user using a broken version).
According to a second aspect, there is provided an apparatus that is configured to implement a method of using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (PKI). The apparatus includes a data collecting arrangement for collecting operations data from the system. The data processing arrangement is configured to categorize the operations data according to dimensions and corresponding dimension values. The data processing arrangement is configured to determine associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI). The data processing arrangement is configured to determine a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) to determine which changes most sensitively affect the at least one key performance indicator (KPI).
The apparatus detects the root cause of operating performance degradation in the system. The root cause is reflected in the data processing arrangement that is used to monitor the performance of systems. The apparatus enhances a performance of the system by monitoring multidimensional data and finding a dimension and a dimension value manifesting the highest influence on the key performance indicator (KPI). A combination of dimensions and dimension values are considered for an accurate multidimensional root cause analysis. The apparatus enables faster diagnosis than human manual process while finding a correct root cause in a large number of items to check against the KPI’s degradation. The apparatus can perform for any domain that collects data as events characterized by categorical or numerical data, and has a goal of finding the item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user. The apparatus can be employed as an anomaly detection solution when run periodically to build a baseline (common unfixable problems, for example a user using a broken version).
According to a third aspect, there is provided a computer program product comprising a non-transitory computer-readable storage medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a computerized device comprising processing hardware to execute a method.
The method of the first aspect, and the apparatus of the second aspect provide a solution to the aforesaid problems of finding a relevant dimension and its item, or multiple combinations of dimensions and their items, that manifest(s) the highest influence on the behaviour of the KPI (for example temporal delay, success rate, download count, and so forth).
Therefore, in contradistinction to the prior art, the aforesaid method and the apparatus are able to determine parameters that are a root cause of operating performance degradation in the system, wherein the root cause is reflected in the data processing arrangement used to monitor the performance of systems. The method enhances the performance of the system by monitoring multidimensional data and finding a dimension and a dimension value manifesting a highest influence on a given key performance indicator (KPI). A combination of dimensions and dimension values are considered for an accurate multidimensional root cause analysis. The method enables faster diagnosis than human manual processing to be achieved, while finding a correct root cause in a large number of items to check against the KPI’s degradation. The method is susceptible to being used for any domain that collects data such as events characterized by categorical or numerical data, and has a goal of finding item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user. The method can be employed as an anomaly detection solution when run periodically to build a baseline (common unfixable problems, for example a user using a broken version).
These and other aspects of the disclosure will be apparent from the implementation(s) described below.
BRIEF DESCRIPTION OF DRAWINGS
Implementations of the disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:
FIG. 1 is a block diagram of an apparatus that is configured to implement a method of using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (PKI) in accordance with an implementation of the disclosure;
FIG. 2 is a block diagram of a data processing arrangement in accordance with an implementation of the disclosure;
FIG. 3 is a block diagram of a multidimensional analysis module in accordance with an implementation of the disclosure;
FIG. 4 is an illustration of an exemplary view of a regression process of a decision tree fitting module in accordance with an implementation of the disclosure;
FIGS. 5 A to 5B are illustrations of exemplary views of a process for determining a relevance of each leaf node using a post-processing module in accordance with an implementation of the disclosure;
FIGS. 6A to 6C are illustrations of exemplary views of a pruning process of each leaf node in accordance with an implementation of the disclosure;
FIG. 7 is an illustration of an exemplary graphical user interface that depicts a summary integration of a multidimensional root cause analysis in accordance with an implementation of the disclosure;
FIGS. 8 A to 8B are flow diagrams that illustrate a method for using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (KPI) in accordance with an implementation of the disclosure; and
FIG. 9 is an illustration of an exemplary apparatus, or a computer system in which the various architectures and functionalities of the various previous implementations may be implemented.
DETAILED DESCRIPTION OF THE DRAWINGS
Implementations of the disclosure provide a method for using a data processing arrangement and an apparatus that is configured to implement the method for using the data processing arrangement.
To make solutions of the disclosure more comprehensible for a person skilled in the art, the following implementations of the disclosure are described with reference to the accompanying drawings.
Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and the accompanying drawings of the disclosure are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the disclosure described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.
FIG. 1 is a block diagram of an apparatus 100 that is configured to implement a method for (namely, a method of) using a data processing arrangement 104 for determining parameters that affect operating performance of a system 106 defined by at least one key performance indicator (PKI) in accordance with an implementation of the disclosure. The apparatus 100 includes a data collecting arrangement 102 for collecting operations data from the system 106. The data processing arrangement 104 is configured to categorize the operations data according to dimensions and corresponding dimension values. The data processing arrangement 104 is configured to determine associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI). The data processing arrangement 104 is configured to determine a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) to determine which changes most sensitively affect the at least one key performance indicator (KPI).
The apparatus 100 aims at identifying the dimension, or dimension values, or a combination thereof causing one or more Key Performance Indicators, KPI’s, to degrade, for example a Key Performance Indicator, KPI. The collected data from the system 106 along with the Key Performance Indicator, KPI, indicate the multitude of factors affecting the operating performance of the system 106. The system 106 is diagnosed to indicate and provide a multidimensional root cause analysis to find one or more dimensions responsible for operating performance degradation on a larger time frame. The data processing arrangement 104 provides supporting data and metrics which may identify the root cause and the exact dimensions causing degradation. A chain of dimensions and dimension values is created in order to lead to a terminal node explaining the detection of the root cause and its analysis.
The apparatus 100 enables faster diagnosis than human manual processes and analysis, while finding a correct root cause in a large number of items (for example many millions) to check against the KPI’s degradation. The apparatus 100 can perform for any domain that collects data as events characterized by categorical or numerical data and has a goal of finding the item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user. The apparatus 100 can be employed as an Anomaly Detection solution when run periodically in order to build a baseline (common unfixable problems, for example user using a broken version)
FIG. 2 is a block diagram of a data processing arrangement 200 in accordance with an implementation of the disclosure. The data processing arrangement 200 includes an alarm system 202, a data source 204, a multidimensional analysis module 206, a business configuration module 208, a domain-expert knowledge module 210, a multidimensional understanding module 212, and a root cause summarizer 214. The alarm system 202 monitors a Key performance Indicator, KPI, which triggers alarms when an overall performance of a system is unsatisfactory (for example, through anomaly detection). The alarms may provide a context needed for retrieving data associated with a timeframe and a scope of a problem. Upon the firing of each alarm, a multidimensional analysis process is triggered at the multidimensional analysis module 206. The data source 204 is a source of multi-categorical multi-dimensional data that is associated with a target value (for example, KPI). The data source 204 may be queried by the multidimensional analysis module 206 for data contextual to the alarm system 202. For example, the data source 204 may obtain data for a certain microservice for 10 minutes leading to the triggering of the alarms, as well as data for the same period but 1 day before to be used as reference.
The business configuration module 208 may be use-case specific and is required only once when the multidimensional analysis module 206 is deployed. The business configuration module 208 includes names of dimensions to be analysed, for direct analysis by the business configuration module 208, as well as names of the dimensions used in post-processing and domain expert knowledge-based causality inference. The business configuration module 208 may include other analysis specific details, such as depth of a tree or a minimum number of samples per leaf. The minimum number of samples per leaf may be a number of observations each leaf must have or else a split cannot occur; this is used for both shortening algorithm’s recursive cycles as well as fulfilling use-case specific requirements if these are necessary. The business configuration module 208 may include an operation mode of an analysis, which can be single problem mode when used in conjunction with the alarms, or multi-problem mode when the multidimensional analysis module 206 is triggered manually and a diagnosis of a monitored system is desired.
The multidimensional analysis module 206 obtains (i) Key performance Indicator, KPI, data, and associated multidimensional data, (ii) a business configuration, and (iii) domain expert knowledge as an input. The domain expert-knowledge module 210 is used to assess a solution in the scope of the use-case (for example, dimension dependency, dimension hierarchy, architecture composition, causal inference rules). This may be fed as an input to the Multidimensional understanding module 212. The multidimensional understanding module 212 recommends a root cause based on a user’s configuration and the domain expert knowledge obtained from the domain expert-knowledge module 210 and based on an output of the multidimensional analysis module 206. The multidimensional understanding module 212 may allow a user to specify dimensions to have their statistics reported on for those leaf nodes selected by the tree as root-cause. Examples of such dimensions include dimensions that are too granular to be useful to the multidimensional analysis module 206, for example, timestamps or unique IDs of a transaction. However, reporting these dimensions once the multidimensional understanding module 212 converged to a recommendation is useful as the dimensions may be fused with domain expertise. The domain expertise may comprise known relationships between the dimensions, which can be transformed into rules that may report an actionable root cause. For example, a specific software application version has to affect a large proportion of users using the same version in order to be the root cause; if the root cause affects a very small percentage of users (for example, < 0.1%), then it is a “single user” problem. Similarly, the user’s county and phone model cannot be part of a combination of dimensions, as the scope of the phone model does not depend on the location of a user.
Other types of domain knowledge are to organize the dimensions into categories, such as network-, or client-, or content-, or server-related dimensions. This may link the root cause to a location of a problem, which in turn may trigger different corrective actions. All these dimension dependencies and logical, or location-based grouping stems from domain knowledge are used for post-processing by the multidimensional understanding module 212 for root cause suggestion. The root cause summarizer 214 suggests a root cause based on data, configuration, and the domain expert knowledge obtained from the domain expert-knowledge module 210. The root cause summarizer 214 may suggest the root cause as consumable output by the system 200 using the items and their importance to present the order in which the problem has propagated through the data. The root cause summarizer 214 may use the data in the problematic leafs to explain how and by how much the distribution of the KPFs value is different or has changed from the expected behaviour. The root cause summarizer 214 may explain which domain expert-rules have been met and how the decision was adjusted according to those insights. In multiproblem mode, the root cause summarizer 214 provides a summary that lists all the possible problems based on their relative severity.
FIG. 3 is a block diagram of a multidimensional analysis module 300 in accordance with an implementation of the disclosure. The multidimensional analysis module 300 includes an input obtaining module 302, a pre-processing module 304, an auto-configuration module 306, a decision tree fitting module 308, a post-processing module 310, and a decision tree-based root cause module 312. The input obtaining module 302 obtains (i) Key performance Indicator, KPI, data, and associated multidimensional data, (ii) a business configuration, and (iii) domain expert knowledge as input data. The input obtaining module 302 obtains raw multidimensional data, reference data, and configurations such as use-case specific requirements, detectable feature names, supporting feature names, and operation modes (for example, single versus multiple problems). The pre-processing module 304 is used to ensure that the data (for example, raw multidimensional data, reference data, and configurations data) is consistent in terms of input to the algorithm, such as filling missing values for certain dimensions with certain values or directly dropping them from the input data.
The auto -configuration module 306 ensures that data is consistent in terms of input to an algorithm. For example, if microservices are monitored for success rate of the fulfilled requests, then a Key performance Indicator, KPI, may be a Boolean variable where True means a successful request and False means a failure in fulfilling a request. In this case, a possible transformation is to map the True values to 0 and the False values to 1. Similarly, if the Key performance Indicator, KPI, is representing a delay/latency in fulfilling those requests, then it is possible to use a KPI value directly, as low values such as 0 represent a good behaviour/performance whereas high values mean high latency in fulfilling the requests which are typically associated with a reason of performance degradation.
The auto-configuration module 306 transforms the data to suit and expedite the decision tree fitting module 308. The auto-configuration module 306 may re-compute each item of each dimension as a contribution difference to its typical or expected influence on the KPI, thereby more accurately pointing to an item or combination of items that reflect the root cause of a problem. The auto-configuration module 306 depths at which a regression tree may stop can be dynamically computed if unspecified in a configuration. For example, this can be computed by taking twice a number of features as maximum depth to ensure that each dimension has space to have at least two of its items singled out as problematic.
The decision tree fitting module 308 obtains the configuration and formatted data and fits a regression decision tree. The regression decision tree may iteratively split the data across different items, each time optimizing for an error or variance of the KPI of the ensuing data.
The post-processing module 310 computes relevance of each leaf node by subtracting a minimum of the KPI’s average across all leaf nodes from a maximum of the KPI’s average, and this yields a range of where the KPI’s average lies. This range is divided by two and this results in a relevance threshold. Leaf nodes above the relevance threshold are considered relevant candidates of being a root cause.
The decision tree -based root cause module 312 provides one or more problems associated with degradation of the KPI based on the configuration such as top 1 or top N problems. If top 1 problem is selected, then a leaf node is selected and presented together with a path from a root node to itself as the root cause. This can be achieved by selecting the leaf node with a highest KPI average value, or the same but weighted according to a number of samples in that node relative to the whole dataset.
If top N problems mode is selected then all relevant leaves are selected, their paths are being aggregated and a summary is proposed. This may be an indication of various problems across a system that is monitored and can be used as an overall system health check across a longer period of time. For the top N problems mode, redundancy may be directly influenced by the relevance threshold, and this can be used as a sensitivity setting.
FIG. 4 illustrates an exemplary view of a regression process of a decision tree fitting module 400 procedure in accordance with an implementation of the disclosure. The decision tree fitting module 400 may perform an iterative process 402 to split data across different items into leaf nodes 406A and 406B. Each item is optimised for an error or variance of the KPI of the ensuing data. In other words, the decision tree fitting module 400 determines a set of parameters ("branches") that are broken down into sub-sets ("leaves" or leaf nodes) to determine which parameters in sub-sets (namely, leaf nodes 406A and 406B) are most sensitively affecting the KPI's. The decision tree fitting module 400 includes two types of nodes: a decision node 404 showing the item leading to a split and terminal or leaf nodes where the iterative process 402 stops. The variance reduction between the decision node 404 and the leaf nodes 406A-B may be used to determine which changes most sensitively affect the at least one key performance indicator (KPI). The item has minimum variance between the leaf node (406A and 406B) across the different items in a dataset. The path of decision from the decision node 404 (namely, a root node) to each of the leaf nodes (406A-F) is a list of items on which the decision was made for the split. The leaf nodes (406A-F) provide insight to whether the key performance indicator (KPI) has been impacted negatively or not by a specific leaf node.
FIGS. 5A and 5B are illustrations of exemplary views of determining a relevance of each leaf node using a post-processing module in accordance with an implementation of the disclosure. The exemplary views include a decision node 502A showing the item leading to a split and leaf nodes 502B-S for each of which the relevancy is determined. The relevance of each leaf node 502B-S is determined by subtracting a minimum average of the key processing indicator (KPI) across all leaf nodes (namely, circled leaf nodes 502F, 502G, 5021, 502J, 502K, 502M, 5020, 502Q, 502S) from the maximum average of the key processing indicator (KPI) across all the leaf nodes, which yields a range of the key performance indicator’s (KPI) average. The range of the key performance indicator’s (KPI) average is divided by two to obtain a relevant threshold. The relevance of each leaf node 502B-S is determined using the following formula: Relevance of each leaf node = (max(kpi value of all leaf nodes) - min(kpi value of all leaf nodes ))/2
The leaf nodes 502B-S that are above the relevant threshold are considered as a relevant candidate (for example, 502F, 502G, 502J) of being a root cause or affecting the key performance indicator (KPI) of a system.
With reference to FIGS. 5 A and 5B, FIGS. 6 A and 6B illustrate exemplary views of a pruning process of each leaf node in accordance with an implementation of the disclosure. The exemplary views show the decision node 502A showing the item leading to a split and the leaf nodes 502B-S on which the pruning process has occurred. The pruning process as shown in FIGS. 6A and 6B depicts that two or more leaf nodes 502B-S with the same relevance provide no extra information, hence these leaf nodes 502B-S may be pruned in order to shorten a tree, shortening the paths leading to the leaf nodes 502B-S. Pruning of the leaf nodes 502B-S may remove items with redundant splits. The leaf nodes 502B-S with the same relevance level are removed and their data is aggregated back to their corresponding parent node, which may become a leaf node, and the split that occurred at that leaf node is hence removed. This pruning process is repeated until all redundancy has been removed. The remaining tree may include a smaller number of leaf nodes and shorter decision paths. The tree is pruned in order to ensure the decision paths to the leaf nodes 502B-S do not include decisions with small gain in information. As such, if two leaf nodes of a same parent node have the same relevance level, then the split of those two leaf nodes that occurred at the parent node is considered redundant, and the split is reverted by pruning those two leaf nodes. For example, in case if both leaf nodes did not meet the cut-off threshold, then both leaf nodes are highly unlikely to be the root cause. In case, if both leaf nodes have met the cut-off threshold, then the split offers a small gain of information. In other words, only the splits have occurred where one of the ensuing leaf nodes has met the cut-off threshold and the other one did not. In case, if the item that is chosen splits the data in two leaf nodes, one leaf node is highly associated with the KPFs degradation and the other leaf node, respectively. In FIG. 6C, once the tree is pruned, the next step is to select from the relevant leaf nodes the one containing the top problem (according to some selection criterion, for example, highest KPI average) or to select top-N problems or all problems that are identified by the remaining relevant leaf nodes, depending on the use-case. FIG. 7 is an illustration of an exemplary graphical user interface that depicts a summary integration of a multidimensional root cause analysis in accordance with an implementation of the disclosure. The graphical user interface depicts an alert list of incidents and their multidimensional root cause analysis in a table 702. The graphical user interface includes a trace graph analytics field 704 and a trace log multidimensional field 706. The trace graph analytics field 704 shows a problem for a selected alarm and the trace log multidimensional field 706 shows the details of a single alert.
FIGS. 8 A to 8C are flow diagrams that illustrate a method for using a data processing arrangement for determining parameters that affect operating performance of a system defined by at least one key performance indicator (KPI) in accordance with an implementation of the disclosure. At a step 802, operations data are collected from the system. At a step 804, the operations data are categorized according to dimensions and corresponding dimension values using the data processing arrangement. At a step 806, associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI) are determined using the data processing arrangement. At a step 808, a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) is determined to determine which changes most sensitively affect the at least one key performance indicator (KPI).
The method detects the root cause of operating performance degradation in the system. The root cause is reflected in the data processing arrangement that is used to monitor the performance of systems. The method enhances a performance of the system by monitoring multidimensional data and finding a dimension and a dimension value manifesting highest influence on the key performance indicator (KPI). A combination of dimensions and dimension values are considered for an accurate multidimensional root cause analysis. The method enables faster diagnosis than is feasible using human manual processing while finding a correct root cause in a large number of items (for example, many millions) to check against the KPI’s degradation. The method can be performed for any domain that collects data as events characterized by categorical or numerical data and has a goal of finding the item(s) correlated with a specific target feature, thereby avoiding domain- specific thresholds to be set by a user. The method can be employed as an Anomaly Detection solution when run periodically to build a baseline (common unfixable problems, for example a user using a broken version).
The method determines particular parameters (for example, sensitivity factor) that strongly influence particular KPI's. For example, user numbers may correlate strongly with latency as a KPI, alternatively, certain models of smart phones may be more prone to drop-out rate where the drop-out rate is a KPI. Some smart phones may have a version of software that is strongly correlated with a poor sound quality, where the sound quality is a KPI. Some models of smart phones with a certain version of software (namely, a combination of two factors) may be prone to latching up (namely, software crash) where the operating reliability is a KPI.
Optionally, the sensitivity factor of the one or more changes is determined over one or more times periods of mutually different durations to reduce an effect of stochastic noise in the operations data when determining the sensitivity factor. One of the time periods may be a reference time period.
Optionally, the method includes performing steps (iii) and (iv) iteratively until the dimensions and corresponding dimensional values that result in a maximum value of the sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) are determined. Optionally, the method includes determining a change in the sensitivity factor resulting from an interactive combination of changes in the dimension values of the dimensions. The method may include determining one or more binary state changes in the dimension values.
Optionally, the method includes determining from the changes that most sensitively affect the at least one key performance indicator (KPI) one or more technical problems affecting the system. Optionally, the method includes using in (ii) a regression decision tree for implementing iteratively a splitting process to split the dimension values according to pre-defined criteria into one or more data sets to determine an influence of the dimension values on the at least one key performance indicator (KPI).
Optionally, the method includes repetitively implementing the splitting process such that a split is found that creates a smallest sum of mean squared errors in the one or more data sets of the dimension values, until a stopping criterion is met. In the regression decision tree, where each decisional split is taken across the data where one certain item is present. The item is selected according to a difference between a negative influence that item has on the KPI as compared to remaining data where the item is not present. The chain of items leading to a terminal node explains how the root cause was detected, as well as the importance of each item in that chain. Results are expressed along with supporting data and metrics computed to conclude. The method does not require a threshold setting. The metrics used in determining which item to split by is relative as it is computed based on the KPI’s value associated with a particular item, and then used to compare item by item and the highest contribution wins.
Optionally, the method further includes determining from the sensitivity factor and its association the at least one key performance indicator (KPI) one or more problems in operation of the system causing a degradation in performance of the system. Optionally, the decision tree may split the dimension to lead a chain of items, further leading to a terminal node, detecting and explaining the root cause for operating performance degradation.
A computer program product including a non-transitory computer-readable storage medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a computerized device including processing hardware to execute a method.
FIG. 9 is an illustration of an exemplary apparatus, or a computer system (900 in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computer system 900 includes at least one processor 904 that is connected to a bus 902, wherein the computer system 900 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI- Express, AGP (Accelerated Graphics Port), Hyper Transport, or any other bus or point- to-point communication protocol (s). The computer system 900 also includes a memory 906.
Control logic (software) and data are stored in the memory 906 which may take a form of random- access memory (RAM). In the disclosure, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.
The computer system 900 may also include a secondary storage 910. The secondary storage 910 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drive at least one of reads from and writes to a removable storage unit in a well-known manner.
Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 906 and the secondary storage 910. Such computer programs, when executed, enable the computer system 900 to perform various functions as described in the foregoing. The memory 906, the secondary storage 910, and any other storage are possible examples of computer-readable media.
In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 904, a graphics processor coupled to a communication interface 912, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 904 and a graphics processor, a chipset (namely, a group of integrated circuits designed to work and sold as a unit for performing related functions, and so forth).
Furthermore, the architectures and functionalities depicted in the various previous- described figures may be implemented in a context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system. For example, the computer system 900 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.
Furthermore, the computer system 900 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, and so forth. Additionally, although not shown, the computer system 900 may be coupled to a network (for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 908.
It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures.
In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware.
Although the disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the scope of the disclosure as defined by the appended claims.

Claims

1. A method for using a data processing arrangement (104, 200) for determining parameters that affect operating performance of a system (106) defined by at least one key performance indicator (KPI), wherein the method includes: (i) collecting operations data from the system (106);
(ii) using the data processing arrangement (104, 200) to categorize the operations data according to dimensions and corresponding dimension values;
(iii) using the data processing arrangement (104, 200) to determine associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI); and
(iv) determining a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) to determine which changes most sensitively affect the at least one key performance indicator (KPI).
2. The method of claim 1, wherein the sensitivity factor of the one or more changes is determined over a plurality of times periods of mutually different durations to reduce an effect of stochastic noise in the operations data when determining the sensitivity factor.
3. The method of claim 2, wherein one of the time periods is a reference time period.
4. The method of claim 1 or 2, wherein the method includes performing steps (iii) and (iv) iteratively until the dimensions and corresponding dimensional values that result in a maximum value of the sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) are determined.
5. The method of claim 1, 2, 3 or 4, wherein the method includes determining a change in the sensitivity factor resulting from an interactive combination of changes in the dimension values of the dimensions.
6. The method of any one of claims 1 to 5, wherein the method includes determining one or more binary state changes in the dimension values.
7. The method of any one of claims 1 to 6, wherein the method includes determining from the changes that most sensitively affect the at least one key performance indicator (KPI) one or more technical problems affecting the system (106).
8. The method of any one of the preceding claims, wherein the method includes using in (ii) a regression decision tree for implementing iteratively a splitting process to split the dimension values according to pre-defined criteria into a plurality of data sets to determine an influence of the dimension values on the at least one key performance indicator (KPI).
9. The method of claim 8, wherein the method includes repetitively implementing the splitting process such that a split is found that creates a smallest sum of mean squared errors in the plurality of data sets of the dimension values, until a stopping criterion is met.
10. The method of any one of claims 1 to 9, wherein the method further includes determining from the sensitivity factor and its association the at least one key performance indicator (KPI) one or more problems in operation of the system (106) causing a degradation in performance of the system (106).
11. An apparatus (100) that is configured to implement a method for using a data processing arrangement (104, 200) for determining parameters that affect operating performance of a system (106) defined by at least one key performance indicator (PKI), wherein:
(i) the apparatus (100) includes a data collecting arrangement (102) for collecting operations data from the system (106);
(ii) the data processing arrangement (104, 200) is configured to categorize the operations data according to dimensions and corresponding dimension values;
(iii) the data processing arrangement (104, 200) is configured to determine associations of one or more changes of dimension values per category of dimension with changes of the at least one key performance indicator (KPI); and
(iv) the data processing arrangement (104, 200) is configured to determine a sensitivity factor of the one or more changes with the changes of at least one key performance indicator (KPI) to determine which changes most sensitively affect the at least one key performance indicator (KPI).
12. A computer program product comprising a non-transitory computer-readable storage medium having computer-readable instructions stored thereon, the computer- readable instructions being executable by a computerized device comprising processing hardware to execute a method of any one of claims 1 to 10.
PCT/EP2021/057045 2021-03-19 2021-03-19 Method and apparatus for multidimensional root cause analysis WO2022194385A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/057045 WO2022194385A1 (en) 2021-03-19 2021-03-19 Method and apparatus for multidimensional root cause analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/057045 WO2022194385A1 (en) 2021-03-19 2021-03-19 Method and apparatus for multidimensional root cause analysis

Publications (1)

Publication Number Publication Date
WO2022194385A1 true WO2022194385A1 (en) 2022-09-22

Family

ID=75267467

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/057045 WO2022194385A1 (en) 2021-03-19 2021-03-19 Method and apparatus for multidimensional root cause analysis

Country Status (1)

Country Link
WO (1) WO2022194385A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170364819A1 (en) * 2016-06-17 2017-12-21 Futurewei Technologies, Inc. Root cause analysis in a communication network via probabilistic network structure
EP3379357A1 (en) * 2017-03-24 2018-09-26 ABB Schweiz AG Computer system and method for monitoring the technical state of industrial process systems
US20210058306A1 (en) * 2019-08-23 2021-02-25 Cisco Technology, Inc. Application performance management integration with network assurance

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170364819A1 (en) * 2016-06-17 2017-12-21 Futurewei Technologies, Inc. Root cause analysis in a communication network via probabilistic network structure
EP3379357A1 (en) * 2017-03-24 2018-09-26 ABB Schweiz AG Computer system and method for monitoring the technical state of industrial process systems
US20210058306A1 (en) * 2019-08-23 2021-02-25 Cisco Technology, Inc. Application performance management integration with network assurance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANG RENYU ET AL: "Intelligent Resource Scheduling at Scale: A Machine Learning Perspective", 2018 IEEE SYMPOSIUM ON SERVICE-ORIENTED SYSTEM ENGINEERING (SOSE), IEEE, 26 March 2018 (2018-03-26), pages 132 - 141, XP033340472, DOI: 10.1109/SOSE.2018.00025 *

Similar Documents

Publication Publication Date Title
US11928014B1 (en) In a microservices-based application, tracking errors by mapping traces to error stacks
US11762728B1 (en) Displaying error stacks in a graphical user interface (GUI) to track error propagation across microservices-based applications
US10467084B2 (en) Knowledge-based system for diagnosing errors in the execution of an operation
Liu et al. Microhecl: High-efficient root cause localization in large-scale microservice systems
US10397810B2 (en) Fingerprinting root cause analysis in cellular systems
US9652316B2 (en) Preventing and servicing system errors with event pattern correlation
JP4942939B2 (en) Method and system for troubleshooting misconfiguration of a computer system based on the configuration of another computer system
EP3895077A1 (en) Explainability-based adjustment of machine learning models
US9852041B2 (en) Systems and methods for categorizing exceptions and logs
Chen et al. CauseInfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment
US20210035026A1 (en) Diagnosing &amp; triaging performance issues in large-scale services
US20160378583A1 (en) Management computer and method for evaluating performance threshold value
WO2021213247A1 (en) Anomaly detection method and device
US9798644B2 (en) Monitoring system performance with pattern event detection
US11093349B2 (en) System and method for reactive log spooling
Jiang et al. Efficient fault detection and diagnosis in complex software systems with information-theoretic monitoring
Nigenda et al. Amazon sagemaker model monitor: A system for real-time insights into deployed machine learning models
US20210366268A1 (en) Automatic tuning of incident noise
CN111859047A (en) Fault solving method and device
WO2022231770A1 (en) Automatic triaging of diagnostics failures
US11675647B2 (en) Determining root-cause of failures based on machine-generated textual data
WO2022194385A1 (en) Method and apparatus for multidimensional root cause analysis
Zou et al. Improving log-based fault diagnosis by log classification
US8984127B2 (en) Diagnostics information extraction from the database signals with measureless parameters
US11444824B2 (en) Knowledge base and mining for effective root-cause analysis

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21714827

Country of ref document: EP

Kind code of ref document: A1