CN113657715A

CN113657715A - Root cause positioning method and system based on kernel density estimation calling chain

Info

Publication number: CN113657715A
Application number: CN202110799721.3A
Authority: CN
Inventors: 李立泓; 闫二乐; 郑康秋; 林诚汉; 陈立峰; 林俊德
Original assignee: Fujia Newland Software Engineering Co ltd
Current assignee: Fujia Newland Software Engineering Co ltd
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2021-11-16

Abstract

The invention provides a root cause positioning method and a root cause positioning system based on a kernel density estimation calling chain in the technical field of computers, wherein the method comprises the following steps: step S10, collecting the availability index and KPI index of each business service, and the response time and success rate index between each business service on each calling chain; step S20, monitoring the availability index based on the set threshold value; step S30, converting KPI indexes of the call chain into node abnormal scores based on kernel density estimation, and converting response time and success rate indexes of the call chain into node edge abnormal scores; s40, loading the node abnormal scores and the node edge abnormal scores to a static topological graph of the I T system to obtain a fault propagation graph; and step S50, carrying out random walk on the fault propagation diagram by using a random walk algorithm, and positioning nodes generating faults and KPI indexes. The invention has the advantages that: the root cause positioning efficiency is greatly improved, and the operation and maintenance cost of the I T system is greatly reduced.

Description

Root cause positioning method and system based on kernel density estimation calling chain

Technical Field

The invention relates to the technical field of computers, in particular to a root cause positioning method and system based on a kernel density estimation calling chain.

Background

With the development of information technology and the cloud of numerous systems, IT architecture has been separated from front and back ends, and becomes complex architectures such as distributed, micro-service and DDD. Today, large-scale IT systems often contain thousands of applications, which are highly dynamic and complex, and a business service in IT systems contains several to thousands of instances, each running on a different container or a different server, and the availability of these instances becomes a key challenge to be faced by large-scale IT systems.

Under the architectures of distributed, micro-service and DDD, a complete service request (service) includes a plurality of service units, and each service system and service unit are called each other to form a call chain, and any exception on the call chain may propagate along the call chain, which finally results in that the service request cannot be executed, which is also a problem commonly encountered by large-scale IT systems. Since the service request cannot be executed to the benefit that will directly affect the user experience and the enterprise, the operation and maintenance engineer needs to monitor the service level KPI (e.g., response time) and the host level KPI (e.g., CPU usage) on each host where the service request is located. When a service request fails, the operation and maintenance engineer must locate the failing machine (root cause/root cause) as soon as possible and resolve the failure quickly.

Aiming at the positioning of root causes, the mode that an operation and maintenance engineer manually checks faults is adopted in the prior art, but because an IT system has a calling relation with complex service and a plurality of indexes, the operation and maintenance engineer is difficult to quickly position the problems in the plurality of services and indexes, and the efficiency of root cause positioning is low.

Therefore, how to provide a root cause positioning method and system based on a kernel density estimation call chain to achieve the purposes of improving the root cause positioning efficiency and reducing the operation and maintenance cost of an IT system becomes a problem to be solved urgently.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a root cause positioning method and system based on a kernel density estimation call chain, so as to improve the root cause positioning efficiency and reduce the operation and maintenance cost of an IT system.

In a first aspect, the present invention provides a method for estimating root cause location of a call chain based on kernel density, comprising the following steps:

step S10, acquiring the availability index and KPI index of each service in the IT system, and the response time and success rate index between each service on each calling chain;

step S20, setting a threshold, monitoring the availability index based on the threshold, and judging whether the IT system has a fault;

step S30, based on kernel density estimation, converting the KPI of the service associated call chain into node abnormal score, and converting the response time and success rate index of the service associated call chain into node edge abnormal score;

step S40, loading the node abnormal score and the node edge abnormal score to a static topological graph of the IT system to obtain a fault propagation graph;

and step S50, carrying out random walk on the fault propagation diagram by using a random walk algorithm, positioning nodes generating faults and KPI (Key performance indicator), and finishing root cause positioning.

Further, in step S10, the KPI indicator at least includes a CPU utilization and a memory utilization.

Further, the step S20 is specifically:

setting a threshold, sequentially judging whether the availability index of each business service is greater than the threshold, if so, indicating that a fault exists, and entering the step S30; if not, indicating that no fault exists, and continuing monitoring.

Further, the step S30 is specifically:

based on kernel density estimation, a KDE model is respectively constructed for the KPI, the response time and the historical data of the success rate index, so as to obtain a probability density function, the KPI, the response time and the success rate index of a service associated call chain in a fault time window are input into the corresponding probability density function to obtain a probability density, and then the probability density is converted into a node abnormity score and a node edge abnormity score through a logarithm function.

Further, the step S50 is specifically:

and carrying out random walk on the fault propagation diagram by using a random walk algorithm, sequentially and randomly accessing the next adjacent node, recording the access times of each node, carrying out descending order arrangement, further positioning the node generating the fault and the corresponding KPI (Key performance indicator), and finishing root cause positioning.

In a second aspect, the present invention provides a system for root cause location based on kernel density estimation call chains, comprising the following modules:

the data acquisition module is used for acquiring the availability index and KPI index of each service in the IT system, and the response time and success rate index between each service on each calling chain;

the availability index monitoring module is used for setting a threshold value, monitoring the availability index based on the threshold value and judging whether the IT system has faults or not;

the kernel density estimation module is used for converting the KPI (Key performance indicator) of the service-related calling chain into a node abnormal score and converting the response time and success rate indicator of the service-related calling chain into a node edge abnormal score based on kernel density estimation;

the fault propagation graph building module is used for loading the node abnormal scores and the node edge abnormal scores to a static topological graph of the IT system to obtain a fault propagation graph;

and the root cause positioning module is used for carrying out random walk on the fault propagation diagram by utilizing a random walk algorithm, positioning nodes generating faults and KPI indexes, and finishing root cause positioning.

Further, in the data acquisition module, the KPI indicator at least includes a CPU utilization rate and a memory utilization rate.

Further, the availability index monitoring module specifically includes:

setting a threshold, sequentially judging whether the availability index of each business service is greater than the threshold, if so, indicating that a fault exists, and entering a kernel density estimation module; if not, indicating that no fault exists, and continuing monitoring.

Further, the kernel density estimation module specifically includes:

Further, the root cause positioning module specifically comprises:

The invention has the advantages that:

by collecting availability indexes, KPI indexes, response time and success rate indexes, when the availability indexes are higher than a set threshold value, a fault exists, the KPI indexes of a service associated call chain are converted into node abnormal scores based on kernel density estimation, the response time and success rate indexes are converted into node side abnormal scores, then the node abnormal scores and the node side abnormal scores are loaded into a static topological graph of an IT system to obtain a fault propagation graph, and finally a random walk algorithm is used for randomly walking the fault propagation graph, so that the nodes and the KPI indexes which generate the fault can be automatically positioned.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for root cause location based on kernel density estimation call chains according to the present invention.

FIG. 2 is a schematic structural diagram of a root cause location system based on kernel density estimation call chain according to the present invention.

Detailed Description

The technical scheme in the embodiment of the application has the following general idea: acquiring availability indexes of IT system business services, response time and success rate indexes among business services on a call chain, and KPI indexes (CPU utilization rate and memory utilization rate) corresponding to the business services; triggering root cause positioning when the availability index of the business service exceeds a set threshold value; converting KPI indexes corresponding to the business services on all the call chains corresponding to the business services with the problems into node abnormal scores through kernel density estimation, and converting response time and success rate indexes between the business services on all the call chains corresponding to the business services with the problems into node edge abnormal scores; combining the static topological graph, and corresponding the node abnormal score and the node edge abnormal score to the static topological graph to form a fault propagation graph; and through a random walk algorithm, random walk is carried out on the fault propagation graph, and the fault node and the failed KPI are automatically positioned so as to improve the root cause positioning efficiency and reduce the operation and maintenance cost of the IT system.

Referring to fig. 1 to 2, a preferred embodiment of a root cause locating method based on kernel density estimation call chain of the present invention includes the following steps:

step S10, continuously collecting the availability index and KPI index of each service, the response time and success rate index between each service on each calling chain in the IT system at intervals of 1 minute, and storing the collected data in an elastic search;

step S20, setting a threshold, monitoring the availability index based on the threshold, and judging whether the IT system has a fault; triggering root cause location when the availability indicator exceeds a threshold;

step S30, converting the KPI of the service-related calling chain into a node abnormal score based on kernel density estimation (Kernel density estimation), and converting the response time and success rate index of the service-related calling chain into a node side abnormal score;

the kernel density estimation is used for estimating an unknown density function in probability theory and belongs to one of nonparametric inspection methods;

step S40, loading the node abnormal score and the node edge abnormal score to a static topological graph of the IT system to obtain a fault propagation graph; the static topological graph is updated regularly through an acquisition program;

and step S50, carrying out random walk on the fault propagation diagram by using a random walk algorithm, positioning nodes generating faults and KPI (Key performance indicator), completing automatic root cause positioning, storing and displaying root cause positioning results, and automatically executing corresponding fault repair operation.

In step S10, the KPI indicators at least include CPU utilization and memory utilization.

The step S20 specifically includes:

The step S30 specifically includes:

The step S50 specifically includes:

and carrying out random walk on the fault propagation diagram by using a random walk algorithm, sequentially and randomly accessing the next adjacent node, recording the access times of each node, carrying out descending order arrangement, further positioning the node generating the fault and the corresponding KPI (Key performance indicator), and finishing root cause positioning. The random walk algorithm is used for calculating the probability of forward, backward and self-direction transition of each node.

The invention relates to a preferred embodiment of a root cause positioning system based on a kernel density estimation calling chain, which comprises the following modules:

the data acquisition module is used for continuously acquiring the availability index and the KPI index of each service, the response time and the success rate index between each service on each calling chain in the IT system at intervals of 1 minute and storing the acquired data to an elastic search;

the availability index monitoring module is used for setting a threshold value, monitoring the availability index based on the threshold value and judging whether the IT system has faults or not; triggering root cause location when the availability indicator exceeds a threshold;

the kernel density estimation module is used for converting the KPI (kernel density estimation) of the service-related calling chain into a node abnormal score and converting the response time and the success rate index of the service-related calling chain into a node side abnormal score based on kernel density estimation;

the fault propagation graph building module is used for loading the node abnormal scores and the node edge abnormal scores to a static topological graph of the IT system to obtain a fault propagation graph; the static topological graph is updated regularly through an acquisition program;

and the root cause positioning module is used for carrying out random walk on the fault propagation diagram by utilizing a random walk algorithm, positioning the nodes generating the faults and the KPI (Key performance indicator), completing automatic root cause positioning, storing and displaying the root cause positioning result and automatically executing corresponding fault repairing operation.

In the data acquisition module, the KPI at least comprises CPU utilization rate and memory utilization rate.

The availability index monitoring module specifically comprises:

The nuclear density estimation module specifically comprises:

The root cause positioning module specifically comprises:

In summary, the invention has the advantages that:

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. A root cause positioning method based on kernel density estimation calling chain is characterized in that: the method comprises the following steps:

2. The method of claim 1, wherein estimating a root cause location of a call chain based on kernel density comprises: in step S10, the KPI indicators at least include CPU utilization and memory utilization.

3. The method of claim 1, wherein estimating a root cause location of a call chain based on kernel density comprises: the step S20 specifically includes:

4. The method of claim 1, wherein estimating a root cause location of a call chain based on kernel density comprises: the step S30 specifically includes:

5. The method of claim 1, wherein estimating a root cause location of a call chain based on kernel density comprises: the step S50 specifically includes:

6. A cause localization system for estimating a call chain based on kernel density, comprising: the system comprises the following modules:

7. The system of claim 6, wherein the root cause location system is further configured to estimate a call chain based on kernel density: in the data acquisition module, the KPI at least comprises CPU utilization rate and memory utilization rate.

8. The system of claim 6, wherein the root cause location system is further configured to estimate a call chain based on kernel density: the availability index monitoring module specifically comprises:

9. The system of claim 6, wherein the root cause location system is further configured to estimate a call chain based on kernel density: the nuclear density estimation module specifically comprises:

10. The system of claim 6, wherein the root cause location system is further configured to estimate a call chain based on kernel density: the root cause positioning module specifically comprises: