CN115964211A

CN115964211A - Root cause positioning method, device, equipment and readable medium

Info

Publication number: CN115964211A
Application number: CN202211698053.6A
Authority: CN
Inventors: 陈超宇; 余航; 雷志超; 李建国
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-04-14
Also published as: WO2024139255A1

Abstract

The embodiment of the specification discloses a root cause positioning method, a root cause positioning device, root cause positioning equipment and a readable medium. The scheme may include: firstly, for abnormal target indexes, training a linear regression model by using time sequence data corresponding to the target indexes and time sequence data corresponding to candidate indexes associated with the target indexes to obtain model regression coefficients so as to screen out possible root indexes from the candidate indexes associated with the target indexes according to the model regression coefficients; then, the attribution analysis method is adopted to calculate the attribution score of each possible root index, so that the root index can be determined from the possible root indexes according to the attribution scores. The root cause positioning method provided by the embodiment of the specification is high in accuracy and strong in interpretability.

Description

Root cause positioning method, device, equipment and readable medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a root cause positioning method, apparatus, device, and computer readable medium.

Background

With the development of information technology, huge data analysis and processing systems, such as distributed data systems, are now widely used. For these huge data analysis processing systems, if a system failure event cannot be solved in a short time, a large amount of services cannot be performed normally, so that the use of users is seriously affected, and huge economic losses are caused. Therefore, there is a need for rapid and accurate fault diagnosis and recovery before the fault affects service. The central task of fault diagnosis and recovery is Root Cause Analysis (RCA), that is, locating which indexes are taken when an abnormality occurs, which cause the abnormality, and is used for analyzing the root cause of system or service failure.

The accurate root cause positioning aiming at each service or system abnormity is an important guarantee for the product quality of the system.

Disclosure of Invention

The embodiment of the specification provides a root cause positioning method, a root cause positioning device, equipment and a computer readable medium, which are used for improving the accuracy of the existing root cause positioning method.

In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:

an embodiment of the present specification provides a root cause positioning method, including:

acquiring first time sequence data corresponding to a target index and second time sequence data corresponding to each candidate index in a first candidate index set associated with the target index in an abnormal time period; the target index is used for reflecting occurrence of system abnormity or business abnormity; the candidate index is used for reflecting the reason of the occurrence of the abnormality;

training a linear regression model based on the first time sequence data and the second time sequence data to obtain a model regression coefficient;

according to the model regression coefficient, determining candidate indexes corresponding to the model regression coefficient meeting a first preset condition from the first candidate index set as belonging to a second candidate index set;

calculating a attribution score of each candidate index in the second candidate index set; the attribution score is used for reflecting the proportion of the change of the target index contributed by the change of the candidate index in the total change of the target index in an abnormal period;

and according to the attribution scores, determining candidate indexes corresponding to attribution scores meeting a second preset condition from the second candidate index set as belonging to the root index set.

An embodiment of this specification provides a root cause positioner, includes:

the data acquisition module is used for acquiring first time sequence data corresponding to a target index and second time sequence data corresponding to each candidate index in a first candidate index set associated with the target index in an abnormal time period; the target index is used for reflecting system abnormity or business abnormity; the candidate index is used for reflecting the reason of the occurrence of the abnormality;

the model coefficient determining module is used for training a linear regression model based on the first time sequence data and the second time sequence data to obtain a model regression coefficient;

the first determining module is used for determining candidate indexes corresponding to the model regression coefficient meeting a first preset condition from the first candidate index set according to the model regression coefficient to belong to a second candidate index set;

an attribution score calculating module for calculating an attribution score of each candidate index in the second candidate index set; the attribution score is used for reflecting the proportion of the variation of the target index contributed by the variation of the candidate index in the total variation of the target index in an abnormal period;

and the second determining module is used for determining the candidate indexes corresponding to the attribution scores meeting a second preset condition from the second candidate index set as belonging to the root factor index set according to the attribution scores.

An embodiment of this specification provides a root cause positioning device, includes:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

acquiring first time sequence data corresponding to a target index and second time sequence data corresponding to each candidate index in a first candidate index set associated with the target index in an abnormal time period; the target index is used for reflecting occurrence of system abnormity or business abnormity; the candidate index is used for reflecting the reason of the abnormality;

according to the model regression coefficients, determining candidate indexes corresponding to the model regression coefficients meeting a first preset condition from the first candidate index set as candidate indexes belonging to a second candidate index set;

calculating a attribution score of each candidate index in the second candidate index set; the attribution score is used for reflecting the proportion of the variation of the target index contributed by the variation of the candidate index in the total variation of the target index in an abnormal period;

Embodiments of the present specification provide a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement a root cause positioning method.

One embodiment of the present description can achieve at least the following advantages: candidate root indexes which possibly cause target index abnormity are preliminarily screened out through a linear regression model, and the attribution score of each candidate root index is calculated based on the influence of the change of the candidate root index on the change of the target index, so that the root index is determined according to the attribution score.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a schematic flowchart of a root cause location method provided in an embodiment of the present disclosure;

fig. 2 is a schematic diagram of an actual application scenario of the root cause location method provided in the embodiment of the present specification;

FIG. 3 is a schematic structural diagram of an exemplary embodiment of a root cause positioning device corresponding to FIG. 1;

fig. 4 is a schematic structural diagram of a root cause locating apparatus corresponding to fig. 1 provided in an embodiment of the present specification.

Detailed Description

To make the objects, technical solutions and advantages of one or more embodiments of the present disclosure more apparent, the technical solutions of one or more embodiments of the present disclosure will be described in detail and completely with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the protection scope of one or more embodiments of the present disclosure.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

System failures and events can have a tremendous impact on modern Information Technology (IT) and distributed data systems widely adopted by financial companies, as they can lead to system crashes and further cause significant financial loss and compromise customer trust. Under the background of huge services and user requirements in a distributed system, each service alarm quickly responds, and root cause detection and troubleshooting of bottom layer candidate indexes are completed, so that the quality of a platform is necessary and important guaranteed. Therefore, root cause location, which is a bridge for connection anomaly detection and fault recovery, plays an important role in the operation and maintenance of the current distributed system.

In the prior art, root cause location is usually modeled as a multidimensional root cause location problem or a graph-based root cause location problem.

In the scheme of multidimensional root cause positioning, the abnormality of a Key Performance Index (KPI) is explained by identifying the combination of abnormal indications of multidimensional attributes corresponding to the KPI. Multidimensional root positioning relies on two assumptions: 1) The value of a KPI in each dimension is equal to the sum of the values of its attributes; 2) All KPIs and their attributes can be monitored. However, when applied in practice, these two assumptions are too harsh, and affect the application scenario and the result accuracy of such methods.

The graph-based root cause location scheme first constructs a cause and effect graph based on a tracking service call or cause and effect discovery algorithm, and then finds a root cause node through rule-based traversal or random walk. One of the major obstacles to the application of trace graphs and rule-based traversals is the systematic invasiveness and the large amount of work that is typically required to enumerate all traces and rules. Using causal discovery methods to learn graph structures, both computational and sample complexity are high, especially for large graphs, the computation speed can be very slow, and inaccurate results can result when the number of observations for all metrics in the graph is small. After obtaining the map, if the random walk method is adopted, when the number of random walks is not large enough, the root cause may not be obtained by convergence.

In order to solve the defects in the prior art, in the embodiments of the present specification, an innovative root cause location method framework is proposed, whose idea is derived from the interpretable artificial Intelligence field (or called interpretable AI, extensible AI, or XAI for short), and the root cause location problem in the field of intelligent Operations and maintenance (aitops) is modeled as an attribution analysis problem, and the framework can attribute the target key indicators of the anomalies to a few candidate indicators through attribution analysis.

Firstly, for abnormal target key indexes, training a linear regression model by using time sequence data corresponding to the target key indexes and time sequence data corresponding to candidate root factor indexes associated with the target key indexes to obtain model regression coefficients, so as to screen out possible root factor indexes from the candidate root factor indexes associated with the target key indexes according to the model regression coefficients; then, the attribution analysis method is adopted to calculate the attribution score of each possible root index, so that the root index can be determined from the possible root indexes according to the attribution scores. The root cause positioning method according to the embodiment of the specification has high efficiency and accuracy and strong interpretability.

Next, a root cause positioning method provided in the embodiments of the specification will be specifically described with reference to the accompanying drawings:

fig. 1 is a schematic flow chart of a root cause positioning method provided in an embodiment of the present specification. It is to be appreciated that the method can be performed by any computing, processing capable apparatus, device, platform, cluster of devices.

As shown in fig. 1, the process may include the following steps:

step 102: the method comprises the steps of obtaining first time sequence data corresponding to a target index and second time sequence data corresponding to each candidate index in a first candidate index set associated with the target index in an abnormal time period.

During system operation or service execution, an abnormality or a failure may occur. The target index can be used for reflecting the occurrence of system abnormality or service abnormality. The candidate index, also called candidate root cause index, can be used to reflect the cause of the abnormality. In actual application, the mapping relationship between the target index and the candidate index is known. In general, one target metric may be associated with more than one candidate metric. The solution of the embodiment of the present specification may be configured to determine, from the more than one candidate index, a root cause index that causes an abnormality in the target index.

IN the database service fault location scenario, the target index may be a tenant key index for reflecting the performance of the tenant, for example, SQL _ SELECT _ RT (tenant SQL query TIME), LOGIC _ read (tenant query logical read number), SQL _ query _ TIME (tenant SQL queuing TIME), RPC _ PACKAGE _ IN/OUT (tenant communication input number/output number), and the like, but is not limited thereto. Accordingly, the candidate index may be an SQL metric for reflecting the performance of each SQL, for example, cpu _ time (cpu time of single SQL), lr (local _ reads) (logical read number of single SQL), queue _ time (queue time of single SQL), rpc _ count (communication number of single SQL), and the like, but is not limited thereto.

In practical applications, taking the tenant key index SQL _ SELECT _ RT (tenant SQL query time) as an example, it can be considered that the target index can be associated with all the measurement cpu _ time of SQL (cpu time of single SQL). Taking the tenant key index LOGIC _ READS as an example, it can be considered that the target index can be associated with all SQL metrics lr (local _ READS, logical read number of single SQL).

In a container fault location scenario, the target indicator may be the number of failed calls occurring in the container; accordingly, the candidate metrics may be container metrics, such as, but not limited to, CPU usage, memory usage, ingress and egress traffic, TCP connection number, and the like.

In actual application, it can be considered that the target index of the number of failed calls occurring in the container is associated with the container metrics such as CPU usage, memory usage, ingress and egress traffic, and TCP connection number.

In the embodiment of the present specification, whether a target index has a fault or is abnormal may be monitored in real time through an online monitoring platform. For each system fault or service abnormality, an abnormal time period including a fault or abnormal time can be determined, and data corresponding to a target index and data corresponding to a candidate index associated with the target index in the abnormal time period are obtained.

In practical application, the acquisition step length of the data corresponding to the target index and the acquisition step length of the data corresponding to the candidate index may be different. For example, in a scenario of positioning a problematic SQL program, the online monitoring platform may acquire tenant key index data aggregated by minutes in an abnormal period, and may acquire sporadically aggregated SQL metric data in the abnormal period. In order to facilitate subsequent model training, the tenant key index data and the SQL measurement data may be aggregated in advance according to a preset time granularity, for example, the tenant key index data and the SQL measurement data are aggregated according to a granularity of 1 minute.

Therefore, step 102 may specifically include: firstly, acquiring target index data corresponding to a target index and candidate index data corresponding to each candidate index in a first candidate index set associated with the target index in an abnormal time period; and then, respectively aggregating the target index data and the candidate index data according to the same preset time granularity to obtain first time sequence data corresponding to the target index data and second time sequence data corresponding to the candidate index data.

In addition, in practical application, a large amount of data loss may exist in both the data corresponding to the target index and the data corresponding to the candidate index, so that after the data are obtained and aggregated respectively according to the same preset time granularity, the missing values can be filled by using adjacent mean values to obtain first time sequence data corresponding to the target index data and second time sequence data corresponding to the candidate index data, thereby ensuring subsequent calculation.

Step 104: and training a linear regression model based on the first time sequence data and the second time sequence data to obtain a model regression coefficient.

In the embodiment of the present specification, the process of training the linear regression model may specifically be a process of obtaining a target index by fitting a candidate index. More specifically, the process of training the linear regression model may include: and taking the target index as a predicted value, and predicting the target index value at the time t according to the candidate index value at the time t-1, the target index value at the time t-1 and the candidate index value at the time t, wherein the time t is any time in the abnormal time period.

The linear regression model may be represented by the following formula (1).

y = X β + ∈ formula (X = X ∈: (X ∈) ₁ )

Wherein y represents a target index and is an n-dimensional vector. Specifically, y represents a time series of a single target index in the vicinity of the abnormality point, the length of the time series being n.

X represents a candidate index and is an n × p matrix. Specifically, X represents a time series of p candidate indexes corresponding to the single target index, and the length of the time series is n.

Beta represents the coefficient of the linear regression model, and is a p-dimensional vector.

E represents white noise.

Equivalently, the formula of the bayesian probability form of the linear regression model of the aforementioned formula (1) may be as follows (2).

p (y | X, β ∈) = N (X β: ∈) formula (2)

In the embodiments of the present specification, the linear regression model used is specifically a bayesian linear regression model. The purpose of training the Bayesian linear regression model is to estimate a model regression coefficient beta, wherein beta j in the model regression coefficient beta represents a coefficient corresponding to the jth candidate index (j takes a value from 0 to p), and can reflect the sensitivity of a target index to the jth candidate index. In practical application, feature selection may be implemented based on model regression coefficients of a bayesian linear regression model obtained through training, and the bayesian linear regression model is hereinafter referred to as a bayesian feature selection model.

The bayesian feature selection model used in the embodiments of the present specification is specifically described below.

In the embodiments of the present specification, on the one hand, in order to ensure that there are not too many independent coefficients affecting the output, it is necessary to ensure sparsity of the coefficients of the model. On the other hand, since different candidate indexes may have similarity, in order to avoid missing reports due to neglect of similar candidate indexes, it is necessary to ensure that the model can handle multiple collinearity among features.

In view of this, on the one hand, to improve the sparsity of the model, in an alternative embodiment, the prior probability distribution of the model regression coefficients β may be made to conform to a horseshoe-shaped prior distribution. The horseshoe prior distribution refers to a problem proposed by Carvalho et al in 2009 to deal with sparsity.

On the other hand, to solve the multiple collinearity problem, in an alternative embodiment, the prior probability distribution of the model regression coefficient β may be a multivariate normal distribution, and the variance of the multivariate normal distribution may be positively correlated with the covariance of the candidate index feature data (i.e., the second time series data), and specifically, the prior probability distribution of the model regression coefficient β may be conformed to the G prior distribution. In statistics, the G-prior distribution is used to represent an objective prior probability distribution of multivariate regression coefficients, originally proposed by ArnoldZellner, which is a key tool for bayesian and empirical bayesian variable selection.

In practical application, the prior probability distribution of the model regression coefficient β can be made to conform to both the G prior distribution and the horseshoe prior distribution. Specifically, the prior probability distribution of the model regression coefficients may conform to the following equation (3).

Wherein g represents a constant. σ is a vector of dimension p, where p represents the number of candidate indices corresponding to a single target index. σ j is the standard semi-cauchy distribution over positive real numbers, where j takes the value 0 to p. And X is an n X p matrix and represents a time sequence of p candidate indexes corresponding to a single target index, wherein n represents the length of the time sequence.

In the prior probability distribution fused with the horseshoe-shaped prior distribution represented by equation (3), g is used as the global contraction parameter, and σ j is used as the local contraction parameter. On one hand, as the global contraction parameter g becomes smaller, all coefficients β j in the model regression coefficient β tend to 0; on the other hand, the local puncturing parameter σ j is such that a portion β j is prevented from being punctured by the global puncturing parameter g. Thus, the resulting β will be sparse. Compared with other prior distributions (e.g., student T distribution, laplace distribution, etc.) for improving model sparsity, the foregoing prior probability distribution fused with horseshoe-shaped prior distribution provided by the embodiments of the present specification is greater for the densities of 0 and very large β j, in other words, zero elements and non-zero elements in the model regression coefficient β can be better distinguished. In practical applications, compared with other prior distributions (e.g., student T distribution, laplacian distribution, etc.) for improving model sparsity, the foregoing prior probability distribution fused with the horseshoe prior distribution provided by the embodiments of the present specification is more robust when dealing with unknown sparsity and a large number of non-zero elements.

Compared with other prior distributions (for example, laplacian distribution and gaussian distribution with diagonal covariance, etc.), the foregoing prior probability distribution provided by the embodiments of the present specification fused with G prior distribution introduces covariance X of feature X ^T And X, so as to characterize the interdependence among different characteristics, namely, characterize the interdependence among different candidate indexes, and in the posterior distribution of the beta obtained by the method, the correlation among different elements in the beta is larger, so that the problem of multiple collinearity can be solved.

As described above, in the embodiments of the present specification, the feature selection is performed using the bayesian feature selection model, which is superior to other linear feature selection models. For example, the Lasso model cannot adapt to the co-linear input features, and when there are large co-linear features, the Lasso model randomly selects one of them and discards the others, which may cause possible root cause false negatives. For another example, the ElasticNet model has certain co-linear feature processing capability on the basis of Lasso, but depends on super-parameter selection. For another example, the ARD model has a certain feature selection capability and also supports a certain collinearity feature selection, but the prior distribution is gaussian distribution and too ideal, and cannot be well adapted to various types of input features.

Step 106: and according to the model regression coefficient, determining the candidate indexes corresponding to the model regression coefficient meeting the first preset condition from the first candidate index set as belonging to a second candidate index set.

In practical applications, the model regression coefficient β j is obtained as a distribution, not as a point. Therefore, in the embodiment of the present specification, when using the model regression coefficient β j to participate in the calculation, the estimated value of β j may be used. The estimated values may include mean, median, quantile, etc. Since q (β j) follows a gaussian distribution, it is preferable that the integral average value < β j > of the model regression coefficient β j be used as an estimated value of the model regression coefficient β j to participate in the calculations such as in

steps

106, 108, and 110.

In an embodiment of the present specification, step 106 may specifically include determining, from the first candidate index set, a candidate index corresponding to a model regression coefficient greater than a first preset threshold as belonging to a second candidate index set. More specifically, from the first candidate index set, a candidate index corresponding to a model regression coefficient β j whose integrated average value < β j > is greater than a first preset threshold may be determined as belonging to a second candidate index set. For convenience of description, hereinafter, the model regression coefficient β j may be replaced with < β j >.

In the sparse linear model, the first preset threshold may be 0 optionally. In this case, when β j is 0, then the corresponding jth candidate index may be ignored; conversely, the corresponding candidate metric may be selected to belong to a second set of candidate metrics.

Step 108: calculating a attribution score of each candidate index in the second set of candidate indices; the attribution score is used for reflecting a proportion of variation of the target index contributed by variation of a candidate index in total variation of the target index during an abnormal period.

For a linear regression model, the model regression coefficient < β j > is equal to the gradient of y with respect to xj, and therefore, the model regression coefficient < β j > describes the sensitivity of y to xj. In other words, the value of < β j > quantifies the effect of small changes in xj on y. Therefore, in the embodiments of the present specification, the attribution score of a candidate index may be set to positively correlate with the model regression coefficient corresponding to the candidate index.

However, high sensitivity does not mean that the contribution to the abnormality of the target index is large. For example, if the model regression coefficient < β j > is non-zero and relatively large, but the candidate index xj is very small, then in actual application, in view of the small scale of xj, the contribution to the abnormality of the target index is not large, and xj needs to be ignored. Therefore, in practical applications, the values of the corresponding candidate indices xj may be introduced when calculating the attribution scores rj. This allows xj to be selected as the root cause indicator only if < β j > and xj are both large.

On this basis, in the embodiment of the present specification, considering that the goal of the scheme is to attribute an abnormality in the target index to a root index in the candidate indexes, it is more focused on a change in y due to a change in xj within an abnormality period, rather than just the value of xj. In other words, what should be more focused on is the marginal effect of the candidate index, i.e., how the target index will change if the abnormal portion in the candidate index is replaced with the normal portion. Therefore, in the embodiments of the present specification, the attribution score of a candidate index may be further set to positively correlate with the variation of the candidate index within the abnormal period. The variation may be a difference between a value of the candidate index at the abnormal time and a background value of the candidate index. The background value may be an average value, a median, or a quantile of each value in the normal time period, or may be any value selected from the normal time periods.

In an alternative embodiment, the value of the attribution score may be calculated according to the following equation (4).

rj＝|<βj>Δ xj | formula ( ₄ )

In actual application, the cause score may be normalized to obtain a calculation formula of the cause score as in the following formula (5).

Wherein rj represents the attribution score of the candidate index j, j takes a value from 0 to p, and p represents the number of the candidate indexes corresponding to a single target index. The larger the attribution score rj corresponding to the candidate index j is, the higher the possibility that the candidate index j is taken as a root factor is.

| represents an absolute value.

And (betaj) represents the model regression coefficient corresponding to the candidate index j.

Δ xj represents the difference between the value of the candidate index j at the abnormal time and the background value of the candidate index j.

Δ y represents the difference between the value of the target index at the abnormal time and the background value of the target index.

Wherein the background value may be a normal value corresponding to a normal time period. Specifically, the background value may be an average value, a median, or a quantile of each value in the normal period, or may be any value selected from the normal period.

Step 110: and according to the attribution scores, determining candidate indexes corresponding to the attribution scores meeting a second preset condition from the second candidate index set as the candidate indexes belonging to the root factor index set.

In an embodiment of the present specification, the step 110 may specifically include sorting, in a descending order according to the attribution score, the candidate indexes in the second candidate index set to obtain a candidate index sequence; and then determining the first several candidate indexes in the candidate index sequence as root indexes.

The second preset condition may include that the candidate index sequences belong to the first k candidate index sequences when the candidate index sequences are obtained in descending order according to the attribution scores. Where k represents the number of preset final root index indexes, and may be set to any value of 1 to 5, for example.

In actual application, for each target index yk, an attribution score rjk for each candidate root cause xj may be calculated; then, for each target index yk, the attribution scores rjk may be arranged in descending order; then, for each target index yk, the k root causes with the largest attribution score may be retained.

According to the scheme of the embodiment of the present specification, through the attribution analysis in the

steps

108 and 110, the contribution degree of the current candidate index change to the target index change can be well measured, and then it is determined which candidate indexes are high in probability as the root. The root cause index obtained by the method has higher accuracy and higher interpretability.

In the method in fig. 1, candidate root indexes which may cause target index abnormality are preliminarily screened out through a linear regression model, then, the attribution scores of the candidate root indexes are further calculated based on the influence of the candidate root index change on the target index change, so that the root indexes are determined according to the attribution scores.

It should be understood that in the method described in one or more embodiments of the present disclosure, the order of some steps may be adjusted according to actual needs, or some steps may be omitted.

In the embodiment of the present specification, in practical applications, if β is sparse, for example, the number of non-zero elements in β is less than a preset k, step 108 and step 110 may be omitted.

Therefore, optionally, before calculating the attribution score of each candidate index in the second candidate index set, the method may further include: judging whether the number of the candidate indexes in the second candidate index set is larger than a preset k value or not; if so, calculating attribution scores of the candidate indexes in the second candidate index set, and determining root indexes from the second candidate index set according to the attribution scores; if not, directly determining each candidate index in the second candidate index set as a root factor index.

Based on the method of fig. 1, the embodiments of the present specification also provide some specific implementations of the method, which are described below.

In practical application, when more than one abnormal target index exists, after the root cause index is determined for each abnormal target index, the obtained root cause indexes corresponding to each target index can be further aggregated, and the aggregation result is transmitted to the subsequent self-healing decision module, so that the self-healing decision module can process the abnormality more efficiently.

Optionally, after determining, according to the attribution score, a candidate indicator corresponding to an attribution score meeting a second preset condition from the second candidate indicator set as belonging to a root indicator set, the method may further include: taking a union set of root cause index sets corresponding to two or more abnormal target indexes to obtain a first aggregation root cause index set; the first aggregate root index set is used for representing root indexes which cause abnormity of at least one target index in the two or more abnormal target indexes.

Optionally, after determining, from the second candidate index set according to the attribution score, a candidate index corresponding to an attribution score meeting a second preset condition as belonging to a root cause index set, the method may further include: taking intersection of root factor index sets corresponding to two or more abnormal target indexes to obtain a second aggregation root factor index set; the second aggregate root index set is used for representing root indexes which cause all the target indexes of the two or more abnormal target indexes to be abnormal.

According to the above description, an exemplary view of an actual application scenario of the root cause location method provided in the embodiments of the present specification is shown in fig. 2.

In fig. 2, step 201: in the system operation or service execution process, system operation or service execution data can be collected according to a preset time interval, and the system operation or service execution data is aggregated to obtain time sequence data which is stored in a time sequence database.

Step 202: the time sequence data corresponding to the target index can be detected in an abnormal manner in real time through an online detection platform, and if the target index is detected to be abnormal, a root cause positioning scheme of the embodiment of the specification is triggered to be executed; for example, in a scenario of performing root cause localization on a problematic SQL program, if there is an abnormality in the target indexes SQL _ SELECT _ RT, LOGIC _ READS, and the like, root cause localization may be triggered.

Step 203: and loading time series data corresponding to the abnormal target index and time series data corresponding to the candidate index associated with the abnormal target index.

Step 204: fitting the linear regression model based on the time series data loaded in step 203 to obtain a model regression coefficient of the linear regression model, and screening the candidate indexes according to the model regression coefficient.

Step 205: and further performing attribution analysis on the candidate indexes screened in the step 204 to obtain attribution scores of the candidate indexes, so as to determine root indexes corresponding to the abnormal target indexes according to the attribution scores.

Step 206: when more than one abnormal target index exists, the root cause indexes corresponding to the abnormal target indexes are aggregated, for example, a union set or an intersection set is taken, and a root cause positioning result for outputting to the self-healing decision module is obtained.

Step 207: and performing a self-healing decision according to the root cause positioning result output in the step 206, and further taking self-healing measures according to the decision.

Based on the root cause positioning method in the embodiments of the present specification, when a target index is abnormal, the root cause index can be positioned from the corresponding candidate index set, that is, which candidate index or candidate indexes are determined as the root cause of the target index abnormality.

The root cause positioning method provided by the embodiment of the specification has the following characteristics:

1) Because the Bayesian feature selection model is selected for feature selection, the method can be well adapted to the situations that the number of candidate indexes is dynamic, unknown and possibly large, and the number of samples (namely, the length of time sequence data) is possibly less than the number of candidate root factors;

2) Because the prior probability distribution fusing the G prior distribution and the horseshoe-shaped distribution is used during the training of the Bayes characteristic selection model, the root cause indexes in the candidate indexes can be better positioned, for example, when the root cause indexes have collinearity, the root cause indexes can be simultaneously found, and when no proper root cause exists, the root cause indexes are ignored;

3) Due to the fact that the root cause is determined by attribution analysis, the method is more accurate, the result has stronger interpretability, and subsequent self-healing decision is facilitated;

4) Compared with root cause positioning schemes such as ARD, LASSO, elastic Net, fsMTS and the like, the method is less time-consuming and higher in accuracy.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method. Fig. 3 is a schematic structural diagram of a root cause positioning device corresponding to fig. 1 provided in an embodiment of the present disclosure. As shown in fig. 3, the apparatus may include:

a data obtaining module 302, configured to obtain first time series data corresponding to a target indicator and second time series data corresponding to each candidate indicator in a first candidate indicator set associated with the target indicator in an abnormal time period; the target index is used for reflecting occurrence of system abnormity or business abnormity; the candidate index is used for reflecting the reason of the occurrence of the abnormality;

a model coefficient determining module 304, configured to train a linear regression model based on the first time series data and the second time series data to obtain a model regression coefficient;

a first determining module 306, configured to determine, according to the model regression coefficient, a candidate index corresponding to a model regression coefficient that meets a first preset condition from the first candidate index set as belonging to a second candidate index set;

an attribution score calculating module 308 for calculating an attribution score for each candidate indicator in the second set of candidate indicators; the attribution score is used for reflecting the proportion of the variation of the target index contributed by the variation of the candidate index in the total variation of the target index in an abnormal period;

a second determining module 310, configured to determine, according to the attribution score, a candidate indicator corresponding to an attribution score meeting a second preset condition from the second candidate indicator set as belonging to the root indicator set.

It will be appreciated that the modules described above refer to computer programs or program segments for performing a certain function or functions. In addition, the distinction between the above-described modules does not mean that the actual program code must also be separated.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method.

Fig. 4 is a schematic structural diagram of an root cause locating apparatus corresponding to fig. 1 provided in an embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 may include:

at least one processor 410; and (c) a second step of,

a memory 430 communicatively coupled to the at least one processor; wherein,

the memory 430 stores instructions 420 executable by the at least one processor 410 to cause the at least one processor 410 to:

and according to the attribution scores, determining candidate indexes corresponding to the attribution scores meeting a second preset condition from the second candidate index set as the candidate indexes belonging to the root factor index set.

Based on the same idea, the embodiment of the present specification further provides a computer-readable medium corresponding to the above method. The computer readable medium has computer readable instructions stored thereon that are executable by a processor to implement the method of:

calculating a attribution score of each candidate index in the second set of candidate indices; the attribution score is used for reflecting the proportion of the variation of the target index contributed by the variation of the candidate index in the total variation of the target index in an abnormal period;

While certain embodiments of the specification have been described above, in some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in this specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other.

The apparatus, the device, and the method provided in the embodiments of the present specification are corresponding, and therefore, the apparatus and the device also have beneficial technical effects similar to those of the corresponding method, and since the beneficial technical effects of the method have been described in detail above, the beneficial technical effects of the corresponding apparatus and device are not described again here.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical blocks. For example, a Programmable Logic Device (PLD) (e.g., a Field Programmable Gate Array (FPGA)) is an integrated circuit whose Logic functions are determined by a user programming the Device. A digital symbol system is "integrated" onto a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip. Furthermore, nowadays, instead of manually manufacturing an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abll (Advanced Boolean Expression Language), AHDL (alternate Hardware Description Language), traffic, CUPL (core universal Programming Language), HDCal, jhddl (Java Hardware Description Language), lava, lola, HDL, PALASM, rhydl (Hardware Description Language), and vhigh-Language (Hardware Description Language), which is currently used in most popular applications. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC625D, atmelAT SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information which can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of root cause location, comprising:

acquiring first time sequence data corresponding to a target index and second time sequence data corresponding to each candidate index in a first candidate index set associated with the target index in an abnormal time period; the target index is used for reflecting system abnormity or business abnormity; the candidate index is used for reflecting the reason of the occurrence of the abnormality;

calculating a attribution score of each candidate index in the second set of candidate indices; the attribution score is used for reflecting the proportion of the change of the target index contributed by the change of the candidate index in the total change of the target index in an abnormal period;

2. The method according to claim 1, comprising in particular:

acquiring target index data corresponding to a target index and candidate index data corresponding to each candidate index in a first candidate index set associated with the target index in an abnormal time period;

and respectively aggregating the target index data and the candidate index data according to the same preset time granularity to obtain first time sequence data corresponding to the target index data and second time sequence data corresponding to the candidate index data.

3. The method of claim 1, wherein the linear regression model specifically comprises a bayesian feature selection model.

4. The method of claim 3, wherein the prior probability distribution of the model regression coefficients is a multivariate normal distribution; wherein a variance of the multivariate normal distribution is positively correlated with a covariance of the second time series data.

5. The method of claim 4, wherein the prior probability distribution of the model regression coefficients conforms to a G prior distribution.

6. The method of claim 5, wherein the prior probability distribution of the model regression coefficients conforms to a horseshoe-shaped prior distribution.

7. The method of claim 6, wherein the prior probability distribution of the model regression coefficients conforms to the following equation:

wherein g represents a constant;

sigma is a vector with p dimensions, wherein p represents the number of candidate indexes corresponding to a single target index;

σ _j is a standard semi-cauchy distribution over positive real numbers, where j is 0 to p;

and X is an n X p matrix and represents a time sequence of p candidate indexes corresponding to a single target index, wherein n represents the length of the time sequence.

8. The method according to claim 1, wherein the determining, from the first candidate index set according to the model regression coefficient, a candidate index corresponding to a model regression coefficient that meets a first preset condition as belonging to a second candidate index set specifically includes:

and determining the candidate indexes corresponding to the model regression coefficients larger than a first preset threshold value from the first candidate index set as belonging to a second candidate index set.

9. The method of claim 1, wherein the attribution score of a candidate indicator is positively correlated to the model regression coefficient corresponding to the candidate indicator.

10. The method of claim 9, wherein the attribution score of a candidate indicator is positively correlated to a change in the candidate indicator over an abnormal period of time.

11. The method of claim 10, wherein the attribution score is calculated according to:

wherein r is _j Representing the attribution score of the candidate index j, wherein the value of j is 0 to p, and p represents the number of the candidate indexes corresponding to a single target index;

| represents an absolute value;

<β _j >representing model regression coefficients corresponding to the candidate indexes j;

Δx _j value indicating candidate index j at abnormal time and the candidateThe difference between the background values of the index j;

12. The method according to claim 1, wherein the determining, from the second candidate index set according to the attribution score, a candidate index corresponding to an attribution score meeting a second preset condition as belonging to a root index set specifically includes:

sorting all candidate indexes in the second candidate index set in a descending order according to the attribution scores to obtain a candidate index sequence;

determining the first several candidate indexes in the candidate index sequence as root factor indexes.

13. The method of claim 1, wherein after determining, from the second set of candidate indicators according to the attribution score, a candidate indicator corresponding to an attribution score meeting a second preset condition as belonging to a set of root indicators, the method further comprises:

taking a union set of root cause index sets corresponding to two or more abnormal target indexes to obtain a first aggregation root cause index set; the first aggregate root index set is used for representing root index which causes abnormality of at least one target index in the two or more abnormal target indexes.

14. The method of claim 1, wherein after determining, from the second set of candidate indicators according to the attribution score, a candidate indicator corresponding to an attribution score meeting a second preset condition as belonging to a set of root indicators, the method further comprises:

taking intersection of root factor index sets corresponding to two or more abnormal target indexes to obtain a second aggregation root factor index set; the second aggregate root index set is used for representing root indexes which cause all the target indexes of the two or more abnormal target indexes to be abnormal.

15. The method of claim 1 for locating problematic SQL programs, wherein,

the target indexes specifically comprise tenant key indexes; the candidate metrics specifically include SQL metrics.

16. The method of claim 1 for container fault root cause localization, wherein,

the target index specifically comprises the number of failed calls occurring in the container; the candidate metrics specifically include container metrics.

17. A root cause location device comprising:

18. A root cause location device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

19. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the root cause localization method of any one of claims 1 to 16.