CN115580528A

CN115580528A - Fault root cause positioning method, device, equipment and readable storage medium

Info

Publication number: CN115580528A
Application number: CN202211288807.0A
Authority: CN
Inventors: 王风玲; 李佰典; 高保庆; 崔伟; 梁鹰
Original assignee: Tianyi Digital Life Technology Co Ltd
Current assignee: Tianyi Shilian Technology Co ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-01-06

Abstract

The method comprises the steps of selecting abnormal project nodes and abnormal association relations from a project topological graph obtained through Pinpoint combing when a service request is abnormal, wherein the project topological graph records the project nodes corresponding to all projects and the association relations among the project nodes, and the association relations are expressed by directed line segments between two interactive project nodes; combining the project nodes of each anomaly and the incidence relation of each anomaly to form an anomaly topological graph; calculating root factor values corresponding to project nodes in the abnormal topological graph on the basis of a Hotspot algorithm; and positioning the fault root according to the root score corresponding to each project node. Based on the method, the fault root factor can be more accurately positioned by calculating the root factor value, and the accuracy of the application is improved. Obviously, the fault root cause can be efficiently and accurately positioned by the method and the device.

Description

Fault root cause positioning method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of fault detection technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for locating a fault root cause.

Background

With the development of distributed systems, completing a service request of a client through distributed deployment has become more and more popular. By means of distributed deployment, various service requirements of the client can be received in time and efficiently completed.

However, with the increase of distributed deployments, the mutual calling relationship between deployments is complex, and when a fault occurs, operation and maintenance personnel need to utilize a large amount of alarm information to troubleshoot the root cause of the fault from the distributed deployments one by one. However, the manual investigation has a complex calling relationship, so that the investigation workload is large and the time consumption is long. Based on this, how to efficiently locate the fault root cause becomes the focus of attention of those in the art.

Disclosure of Invention

In view of the above, the present application provides a method, an apparatus, a device and a readable storage medium for locating a fault root cause, which are used to locate the fault root cause efficiently.

In order to achieve the above object, the following solutions are proposed:

a fault root cause positioning method comprises the following steps:

when a service request is abnormal, selecting more than one abnormal project node and more than one abnormal incidence relation from a project topological graph obtained by combing a distributed link tracking tool Pinpoint, wherein the project topological graph records the project nodes corresponding to all projects and the incidence relation between all the project nodes, and the incidence relation is represented by directed line segments between two interactive project nodes;

combining the project nodes of each anomaly and the incidence relation of each anomaly to form an anomaly topological graph;

calculating a root cause value corresponding to each project node in the abnormal topological graph based on a root cause analysis algorithm Hotspot algorithm;

and positioning the fault root cause according to the root cause value corresponding to each project node.

Optionally, the process of obtaining the project topology map by the distributed link tracking tool Pinpoint includes:

collecting project information of each project based on the Pinpoint, wherein the project information comprises project nodes and association relations required by the project corresponding to the project information;

writing the project information of each project into a distributed storage system HBase;

and combing the information of each item in the HBase to obtain a topological graph of the item.

Optionally, the collecting the project information of each project based on Pinpoint includes:

a Pinpoint probe is arranged on each project node, so that when a project request reaches the project node corresponding to the project request, a request identifier corresponding to the project request generated by the Pinpoint probe in the project node is obtained;

and combing more than one project node required by completing the project request and the association relation required by completing the project request based on the request identifier and the Pinpoint, and combining more than one project node required by completing the project request and the association relation required by completing the project request into project information.

Optionally, the calculating a root cause value corresponding to each item node in the abnormal topological graph based on a root cause analysis algorithm Hotspot algorithm includes:

determining an associated project node corresponding to the project node based on an association relation corresponding to each project node in the abnormal topological graph, and acquiring the current access success rate and the predicted access success rate of the project node and the response success rate of the associated project node corresponding to the project node based on a Hotspot algorithm;

calculating a change success rate corresponding to the project node, wherein the change success rate is the difference between the current access success rate and the predicted access success rate;

calculating a derived value of the associated project node corresponding to the project node by using the predicted access success rate of the project node, the response success rate of the associated project node and the change success rate;

and calculating the root factor value of the project node by using the derived value of the associated project node corresponding to the project node.

Optionally, the calculating a derived score of the associated project node corresponding to the project node by using the predicted access success rate of the project node, the response success rate of the associated project node, and the change success rate includes:

calculating the product between the change success rate and the response success rate of the associated project node;

calculating a ratio between the product and the predicted access success rate;

and determining a derived score of the associated project node corresponding to the project node based on the sum of the ratio and the response success rate of the associated project node.

Optionally, the locating a fault root cause according to the root cause value corresponding to each item node includes:

comparing root factor values corresponding to the project nodes, and selecting the maximum root factor value;

and taking the project node corresponding to the maximum root cause value as a fault root cause.

Optionally, the combining the item nodes of each exception and the association relationship of each exception to form an exception topological graph includes:

collecting project nodes of all exceptions and association relations of all exceptions;

and performing de-duplication and combination on the collected abnormal project nodes and the association relation of the abnormalities to obtain an abnormal topological graph.

A fault root cause locating device comprising:

the system comprises a selection unit and a processing unit, wherein the selection unit is used for selecting more than one abnormal project node and more than one abnormal incidence relation from a project topological graph obtained by combing a distributed link tracking tool Pinpoint when a service request is abnormal, the project topological graph records the project nodes corresponding to all projects and the incidence relation among all the project nodes, and the incidence relation is represented by directed line segments between two interactive project nodes;

the combination unit is used for combining the project nodes of each exception and the association relation of each exception to form an exception topological graph;

the calculation unit is used for calculating a root cause value corresponding to each project node in the abnormal topological graph on the basis of a root cause analysis algorithm Hotspot algorithm;

and the positioning unit is used for positioning the fault root cause according to the root cause value corresponding to each project node.

A fault root cause location device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program, and implement each step of the above fault root cause locating method.

A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for fault root cause localization as described above.

According to the technical scheme, when a service request is abnormal, more than one abnormal project node and more than one abnormal association relation are selected from a project topological graph obtained by combing through a distributed link tracking tool Pinpoint, and the abnormal project node and the abnormal association relation which need to be checked can be obtained more quickly, wherein the project topological graph records the project nodes corresponding to all projects and the association relation among all the project nodes, and the association relation is expressed by directed line segments between two interactive project nodes; combining the project nodes of each anomaly and the incidence relation of each anomaly to form an anomaly topological graph; based on the method, the abnormal association relation and the project node can be collected to obtain the abnormal topological graph, and when the fault root cause is subsequently checked, the abnormal topological graph does not need to be trapped in a complex calling relation any more, so that the efficiency of positioning the fault root cause is further improved; calculating a root cause value corresponding to each project node in the abnormal topological graph based on a root cause analysis algorithm Hotspot algorithm; and based on the positioning of the fault root cause according to the root cause value corresponding to each project node, the fault root cause can be positioned more accurately by calculating the root cause value, and the accuracy of the application is improved. Obviously, the fault root cause can be efficiently and accurately positioned by the method and the device.

In addition, the incidence relation among the project nodes can be accurately combed through the Pinpoint, and therefore a more reliable project topological graph covering the project nodes corresponding to the projects and the incidence relation among the project nodes is obtained.

In addition, on the basis of a root cause analysis algorithm Hotspot algorithm, when the root cause value is calculated, the incidence relation among abnormal project nodes in the abnormal topological graph can be concerned, so that the accuracy of the positioned fault root cause is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a fault root cause locating method according to an embodiment of the present disclosure;

FIG. 2 is a project topology diagram of an example of an embodiment of the present application;

fig. 3 is a block diagram of a fault root cause locating device according to an embodiment of the present disclosure;

fig. 4 is a hardware structure block diagram of a fault root cause locating device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The fault root cause positioning method can be applied to the technical field of operation and maintenance, and can be used for efficiently maintaining the fault root cause when a fault occurs by positioning the fault root cause so as to reduce adverse effects caused by the fault. The fault root cause positioning method can be arranged in various system platforms, the bearing hardware of the various system platforms can be various electronic equipment or various electronic terminals, and the execution main body of the method is a processor of the system platform.

Referring to fig. 1, the fault root cause locating method of the present application will be described in detail, which includes the following steps:

s1, selecting more than one abnormal project node and more than one abnormal incidence relation from a project topological graph obtained through distributed link tracking tool Pinpoint combing.

Specifically, when an abnormal condition occurs in the service processing process, the abnormal item node and the association relationship with the data transmission problem may be selected from a preset item topology map according to the fault reminding information, the abnormal information or the data transmission problem occurring in the service processing process.

The preset project topological graph is formed by collecting project nodes corresponding to projects in advance according to Pinpoint and completing the projects according to incidence relations among the project nodes, combing the project nodes and the incidence relations, combining the combed project nodes and the incidence relations into the project topological graph, storing the project topological graph, and calling the project topological graph when a fault occurs. Further, when an item is updated, the item topology map can be updated at the same time.

Pinpoint is a distributed link tracking tool, and the incidence relation among all project nodes is combed in a distributed link tracking mode.

The incidence relation is a data transmission relation, and the representation mode in the project topological graph is a directed line segment between two interactive project nodes, as shown in fig. 2. The directed line segment indicates the data interaction mode between the project nodes at the two ends of the directed line segment.

Referring to fig. 2, a-n are each project node, a directed line segment between each project node is an incidence relation, and a directed line segment pointing to e indicates that a can transmit data to e in the process of project processing.

And S2, combining the project nodes of each anomaly and the incidence relation of each anomaly to form an anomaly topological graph.

Specifically, the abnormal project nodes and the abnormal association relationship are de-duplicated and combined into an abnormal topological graph, and the fault root is located in the abnormal topological graph.

And S3, calculating a root cause value corresponding to each project node in the abnormal topological graph based on a root cause analysis algorithm Hotspot algorithm.

Specifically, according to a Hotspot algorithm, calculating a root factor value corresponding to each project node in the abnormal topological graph, wherein the root factor value indicates the possibility that the project node is a fault root factor, and the higher the root factor value is, the higher the possibility that the project node is the root factor value is.

Generally, when an item node is abnormal, the association relationship related to the item node is also abnormal, and the association relationship indicates a data transmission relationship between the item nodes and the association relationship cannot leave the item node, so that fault maintenance can be completed by maintaining the item node, and each item node in the abnormal topological graph has a corresponding association relationship.

And S4, positioning the fault root cause according to the root cause value corresponding to each project node.

Specifically, there are various ways to locate the fault root according to the size of the root score corresponding to each project node, for example, the root scores corresponding to each project node may be sorted, and the project node corresponding to the largest root score is selected as the fault root; or setting a fault fixed value, and taking the project nodes corresponding to the root cause values exceeding the fault fixed value as fault root causes; the two modes can be combined, when the maximum root cause values are not more than the fault fixed value, the maximum root cause value is selected as the fault root cause, and when more than one root cause values are more than the fault fixed value, the project nodes corresponding to the root cause values which exceed the fault fixed value are all used as the fault root causes.

Different fault root cause positioning modes can be selected according to different actual conditions.

According to the technical scheme, when a service request is abnormal, more than one abnormal project node and more than one abnormal association relation are selected from a project topological graph obtained by combing through a distributed link tracking tool Pinpoint, and abnormal project nodes and abnormal association relations which need to be checked can be obtained more quickly, wherein the project topological graph records the project nodes corresponding to all projects and the association relations among all the project nodes, and the association relations are represented by directed line segments between two interactive project nodes; combining the project nodes of each anomaly and the incidence relation of each anomaly to form an anomaly topological graph; based on the method, the abnormal association relation and the project node can be collected to obtain the abnormal topological graph, and when the fault root cause is subsequently checked, the abnormal topological graph does not need to be trapped in a complex calling relation any more, so that the efficiency of positioning the fault root cause is further improved; calculating a root factor value corresponding to each project node in the abnormal topological graph on the basis of a root factor analysis algorithm Hotspot algorithm; and based on the positioning of the fault root cause according to the root cause value corresponding to each project node, the fault root cause can be positioned more accurately by calculating the root cause value, and the accuracy of the application is improved. Obviously, the fault root cause can be efficiently and accurately positioned by the method and the device.

In addition, on the basis of a root cause analysis algorithm Hotspot algorithm, when a root cause score is calculated, the incidence relation among abnormal project nodes in the abnormal topological graph can be concerned, so that the accuracy of the positioned fault root cause is improved.

The process of combing the topological graph of items using Pinpoint as mentioned in step S1 will be explained in detail below, with the following steps:

and S10, collecting project information of each project based on the Pinpoint, wherein the project information comprises project nodes and association relations required by the project corresponding to the project information.

Specifically, based on Pinpoint, a distributed link tracking mode is adopted to track the processing process of each project, and the project nodes and the association relation involved in the processing process of each project are obtained, and the project nodes and the association relation involved in the processing process of the project constitute project information corresponding to the project.

And S11, writing the item information of each item into the distributed storage system HBase.

Specifically, the project information of each project may be written into the HBase, where the HBase is a distributed storage system and is a distributed column-oriented open source database.

And S12, combing the information of each item in the HBase to obtain a topological graph of the item.

Specifically, duplication removal is carried out on each item information in the HBase, the association relationship corresponding to each item node is determined, an item topological graph is constructed according to the association relationship of each item node, and the item topological graph is displayed in a WEB page form.

It can be seen from the foregoing technical solutions that the present embodiment provides an alternative way for constructing a project topology map by using Pinpoint, and through the above way, the project nodes and the association relations related to each project can be recombined and recombined to obtain the project topology map. Through the process, the incidence relation among the project nodes can be combed more accurately, and therefore a more reliable project topological graph is constructed.

In some embodiments of the present application, a process of collecting project information of each project based on Pinpoint at S10 is described in detail, and the steps are as follows:

s100, a Pinpoint probe is arranged on each project node, so that when a project request reaches the project node corresponding to the project request, a request identification corresponding to the project request generated by the Pinpoint probe in the project node is obtained.

Specifically, the project node may be a service that may be used when processing a service, such as tomcat, nginx, redis, springboot, and resin.

Wherein tomcat is a Servlet container developed by Apache; nginx is a free, open-source, high-performance HTTP server and reverse proxy server; remote Dictionary service (Remote Dictionary Server), which is an open-source log-type database written in ANSI C language, supporting network, based on memory and persistent; a springboot is a frame that focuses on the frame. The springboot is the integration of various frameworks, so that the frameworks are integrated more simply, and the templated configuration of the framework in the integration process is simplified; resin is a very fast engine, which itself comprises a server, and not only can it display dynamic content, but it is also very capable of displaying static content.

Pinpoint probes may be dropped for all project nodes, for example, pinpoint probes may be dropped on tomcat corresponding to each project.

After a Pinpoint probe is drilled on a project node, when a project request reaches the project node, the Pinpoint probe of the project node may generate a request identifier corresponding to the project request, the request identifier may be used to record trace data from a controller and complete the request, and the trace data is stored in the HBase, where the trace data includes a processing result of the project request at the project node and a processing procedure executed when the processing result is obtained, the processing result may need to be transmitted to another project node together with the project request, so that the project node finally completes the project request by using the processing result, and when the processing result reaches another project node, the Pinpoint probe of the another project node also needs to generate a request identifier, that is, based on that the Pinpoint probe can track the processing procedure, the processing result and the transmission direction of the processing result when each project request reaches the corresponding project node, the transmission direction indicates an association relationship between each project node, thereby obtaining project information.

S101, combing more than one project node required for completing the project request and the incidence relation required for completing the project request based on the request identification and the Pinpoint, and combining more than one project node required for completing the project request and the incidence relation required for completing the project request into project information.

Specifically, based on the request identifier and Pinpoint, trace data including association relationships and project nodes required for completing each project request is acquired from the Spring MVC controller, the trace data is written into the HBase, the project nodes and association relationships stored in the HBase are deduplicated, and the deduplicated data is combined to obtain the project topology map shown in fig. 2.

The Spring MVC controller is responsible for coordinating and organizing the different components to complete project requests and return responses.

As can be seen from the foregoing technical solutions, the present embodiment provides an alternative way for acquiring project information, and the above way can further use the request identifier and Pinpoint generated by the Pinpoint probe to track the processing process and processing result when each project request reaches the corresponding project node, and determine the association relationship between the project nodes, so that the association relationship between the project nodes can be more easily identified, and the project node required for completing the project request can be more easily determined.

In some embodiments of the present application, a detailed description is given of the process of combining the item nodes of each anomaly and the association relationship of each anomaly to form the anomaly topology map in step S2, and the steps are as follows:

and S20, collecting the project nodes of the exceptions and the association relations of the exceptions.

Specifically, according to the reminding information displayed when the fault occurs, the abnormal item nodes and the abnormal association relationship are determined according to the reminding information, and all the item nodes and the association relationship mentioned by the reminding information are collected.

And S21, removing duplication of the collected abnormal project nodes and the incidence relation of the abnormalities and combining the collected abnormal project nodes and the incidence relation of the abnormalities to obtain an abnormal topological graph.

Specifically, the collected abnormal project nodes and the association relations of the abnormalities are cleaned, and the cleaned results are combined into a topological graph to obtain an abnormal topological graph.

According to the technical scheme, the embodiment provides an optional mode for obtaining the abnormal topological graph, and through the mode, each item node and each association relation which need to be concerned can be determined according to the fault condition, and each item node and association relation which need to be concerned are combined into the abnormal topological graph, so that the fault root factor can be determined from the abnormal topological graph, and the efficiency of determining the fault root factor is further improved.

In some embodiments of the present application, a detailed description is given to the process of calculating a root cause score corresponding to each item node in the abnormal topological graph based on a root cause analysis algorithm Hotspot algorithm in step S3, and the steps are as follows:

s30, determining the associated project node corresponding to the project node based on the association relationship corresponding to each project node in the abnormal topological graph, and acquiring the current access success rate and the predicted access success rate of the project node and the response success rate of the associated project node corresponding to the project node based on a Hotspot algorithm.

Specifically, the association relationship corresponding to the project node in the abnormal topology map may be read, the project node to which the project node needs to transmit data in the abnormal topology map is determined, and the project node that needs to receive the data transmitted by the project node is the associated project node of the project node, referring to fig. 2, if the project node a, the project node e, the project node f, and the project node g all exist in the abnormal topology map, the associated project node of the project node a is the project node e, the project node f, and the project node g.

The access success rate provides the probability of correct response when the project node is accessed. The current access success rate may be a probability that the item node provides a correct response when accessed at the current time. The prediction success rate may be determined based on a historical success rate, for example, an average success rate of the previous month may be taken as the prediction success rate. The response success rate is a probability that the associated project node can provide a correct response when the project node transmits data to the associated project node.

The success rate is the finest granularity KPI, and the finest granularity KPI is calculated by integrating the page browsing rate, the dial testing success rate, the port detection success rate, the PING command success rate, the browsing nature person and the like.

KPI is PV, dial testing success rate, port testing success rate, PING success rate.

Based on the method, the average success rate of the project nodes in the previous month, the access success rate of the project nodes in the current time and the response success rate of the project nodes when the project nodes transmit data to each associated project node can be determined on the basis of a Hotspot algorithm.

S31, calculating a change success rate corresponding to the project node, wherein the change success rate is the difference between the current access success rate and the predicted access success rate.

Specifically, the change success rate indicates a change value of the project node, and a difference between a current access success rate and a prediction success rate of the project node is used as the change success rate corresponding to the project node.

And S32, calculating a derivation value of the associated project node corresponding to the project node by using the predicted access success rate of the project node, the response success rate of the associated project node and the change success rate.

Specifically, when the possibility that a project node is a fault root is examined, the change success rate of the project node can be transmitted to all associated project nodes through a chain reaction, and the derivation score of the association relationship corresponding to the project node is obtained. Since the associated project nodes of the project node a are the project node e, the project node f and the project node g, the derived scores of the project node e, the project node f and the project node g corresponding to the project node a can be calculated by using a chain reaction.

The predicted access success rate, the response success rate of each associated project node and the change success rate can be processed to determine the proportion of each associated project node in the change success rate, and the derived value of each associated node is determined according to the proportion.

And S33, calculating the root factor value of the project node by using the derived value of the associated project node corresponding to the project node.

Specifically, the root score may be calculated by using the derived score of each associated project node corresponding to the project node, and the maximum data may be selected from the root scores corresponding to the derived scores as the root score corresponding to the project node.

The process of calculating the root score using the derived scores is as follows: calculating a first Euclidean distance between the current access success rate of the project node and the derived value of any associated project node;

calculating a second Euclidean distance between the current access success rate of the project node and the predicted access success rate of the associated project node;

calculating the ratio of the first Euclidean distance to the second Euclidean distance;

the difference between 1 and the ratio is calculated and the larger data is selected from 0 and the difference as the root cause value.

By the above-described process of calculating the root factor value using the derived values, the root factor value of the project node related to the associated node can be calculated using the derived value of each associated project node corresponding to the project node. And when the project node has a plurality of association relation nodes, selecting the maximum root factor value from the root factor values of the project nodes related to the association nodes as the final root factor value of the project node.

Or selecting the maximum derived score from the plurality of derived scores corresponding to the project node as the derived score finally corresponding to the project node, and calculating the root score corresponding to the project node by using the derived score finally corresponding to the project node through the process of calculating the root score by using the derived score.

It can be seen from the foregoing technical solutions that, this embodiment provides an alternative way for calculating root cause values corresponding to project nodes, and through the above way, the association relationship, current access success rate, predicted access success rate, response success rate, and the like of each project node on an abnormal topological graph can be comprehensively considered, and the possibility that the project node is a failure root cause is calculated.

In some embodiments of the present application, a detailed description is given to the process of calculating the derived score of the associated project node corresponding to the project node in step S32 by using the predicted access success rate of the project node, the response success rate of the associated project node, and the change success rate, and the specific steps are as follows:

and S320, calculating the product of the change success rate and the response success rate of the associated project node.

Specifically, the change success rate may be multiplied by the response success rate of the associated project node to obtain a product.

S321, calculating a ratio between the product and the predicted access success rate.

Specifically, a ratio between the product and the predicted access success rate of the project node may be calculated.

S322, determining the derivation score of the associated project node corresponding to the project node based on the sum of the ratio and the response success rate of the associated project node.

Specifically, the ratio may be added to the response success rate of the associated item node to obtain an addition result, and the addition result is used as the derived score of the associated item node.

It can be seen from the foregoing technical solutions that, in this embodiment, an optional manner for calculating a derived score of each association node is provided, and through the manner, a change success rate corresponding to each item node may be allocated to association relationship nodes according to a certain proportion, so as to determine a derived score corresponding to each association relationship. In the method, when the derivation value is calculated, the change success rate, the response success rate and the prediction success rate are comprehensively referred to, the derivation value can be obtained more accurately, and the reliability of the method is further improved.

In some embodiments of the present application, a detailed description is given to the process of positioning the fault root according to the root score value corresponding to each project node in step S4, where the steps are as follows:

and S40, comparing the root factor values corresponding to the project nodes, and selecting the maximum root factor value.

Specifically, the root factor values are sorted in descending order, and the root factor value ranked first is the largest root factor value, so that the root factor value ranked first is selected, and the root factor value ranked first can be removed in sorting.

And S41, taking the project node corresponding to the maximum root cause value as a fault root cause.

Specifically, the project node corresponding to the maximum root cause value and the associated project node corresponding to the derivation value used in the process of calculating the maximum root cause value can be used as fault root causes, the fault root causes are maintained, and project requests are restarted;

and after the project restarting request still breaks down, repeatedly selecting the root cause value of the first rank as the maximum root cause value, removing the root cause value in the rank after selection, maintaining the fault root cause by taking the project node corresponding to the maximum root cause value and the associated project node corresponding to the derivation value utilized in the process of calculating the maximum root cause value as the fault root cause, and restarting the project restarting request until the project restarting request does not break down any more.

According to the technical scheme, the fault root cause positioning selectable mode is provided, the sizes of all root cause values can be compared through the mode, the fault root cause can be positioned according to the comparison result, and the fault root cause can be positioned more easily through the process.

The following describes the fault root cause positioning device provided in the embodiment of the present application, and the fault root cause positioning device described below and the fault root cause positioning method described above may be referred to in correspondence with each other.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a fault root cause positioning device according to an embodiment of the present disclosure.

As shown in fig. 3, the fault root cause locating device may include:

the system comprises a selecting unit 1, a processing unit and a processing unit, wherein the selecting unit 1 is used for selecting more than one abnormal project node and more than one abnormal incidence relation from a project topological graph obtained by combing a distributed link tracking tool Pinpoint when a service request processing abnormality occurs, the project topological graph records the project nodes corresponding to all projects and the incidence relation among the project nodes, and the incidence relation is represented by directed line segments between two interactive project nodes;

a combination unit 2, configured to combine the item nodes of each exception and the association relationship of each exception to form an exception topological graph;

the calculating unit 3 is used for calculating a root cause value corresponding to each project node in the abnormal topological graph on the basis of a root cause analysis algorithm Hotspot algorithm;

and the positioning unit 4 is used for positioning the fault root cause according to the root cause value corresponding to each project node.

Optionally, the selecting unit may include:

the information collection unit is used for collecting project information of each project based on the Pinpoint, and the project information comprises project nodes and association relations required by the project corresponding to the project information;

the information writing unit is used for writing the project information of each project into the distributed storage system HBase;

and the information combing unit is used for combing the information of each item in the HBase to obtain an item topological graph.

Optionally, the information collecting unit may include:

the first information collection unit is used for drilling a Pinpoint probe on each project node so as to obtain a request identifier corresponding to the project request generated by the Pinpoint probe in the project node when the project request reaches the project node corresponding to the project request;

and the second information collection unit is used for combing more than one item node required for completing the item request and the association relation required for completing the item request based on the request identifier and the Pinpoint, and combining more than one item node required for completing the item request and the association relation required for completing the item request into item information.

Optionally, the calculating unit may include:

a success rate determining unit, configured to determine an associated project node corresponding to the project node based on an association relationship corresponding to each project node in the abnormal topological graph, and obtain a current access success rate and a predicted access success rate of the project node and a response success rate of the associated project node corresponding to the project node based on a Hotspot algorithm;

a difference value calculating unit, configured to calculate a change success rate corresponding to the project node, where the change success rate is a difference between the current access success rate and the predicted access success rate;

a derived score calculation unit for calculating a derived score of the associated project node corresponding to the project node using the predicted access success rate of the project node, the response success rate of the associated project node, and the change success rate;

and the root score calculating unit is used for calculating the root score of the project node by using the derived score of the associated project node corresponding to the project node.

Optionally, the derived score calculating unit may include:

a product calculation unit for calculating a product between the change success rate and the response success rate of the associated project node;

a ratio calculation unit for calculating a ratio between the product and the predicted access success rate;

and the ratio adding unit is used for determining the derivation score of the associated project node corresponding to the project node based on the sum of the ratio and the response success rate of the associated project node.

Optionally, the positioning unit may include:

the maximum root factor value selecting unit is used for comparing the root factor values corresponding to the project nodes and selecting the maximum root factor value;

and the fault root cause positioning unit is used for taking the project node corresponding to the maximum root cause score as a fault root cause.

Optionally, the combination unit may include:

the first combination unit is used for collecting the abnormal project nodes and the abnormal association relation;

and the second combination unit is used for carrying out duplication removal and combination on the collected abnormal project nodes and the association relations of the abnormalities to obtain an abnormal topological graph.

The fault root cause positioning device provided by the embodiment of the application can be applied to fault root cause positioning equipment such as a PC terminal, a cloud platform, a server cluster and the like. Alternatively, fig. 4 shows a block diagram of a hardware structure of the fault root cause locating device, and referring to fig. 4, the hardware structure of the fault root cause locating device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits or the like configured to implement an embodiment of the present invention;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

when a service request is abnormal, selecting more than one abnormal project node and more than one abnormal association relationship from a project topological graph obtained by combing a distributed link tracking tool Pinpoint, wherein the project topological graph records the project nodes corresponding to all projects and the association relationship between all the project nodes, and the association relationship is represented by directed line segments between two interactive project nodes;

combining the abnormal project nodes and the abnormal incidence relations to form an abnormal topological graph;

and positioning the fault root according to the root score corresponding to each project node.

Alternatively, the detailed function and the extended function of the program may refer to the above description.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. The various embodiments of the present application may be combined with each other. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A fault root cause positioning method is characterized by comprising the following steps:

2. The method for locating the fault root cause according to claim 1, wherein the process of obtaining the project topology map by the distributed link tracking tool Pinpoint carding comprises:

3. The method for locating the fault root cause according to claim 2, wherein the collecting project information of each project based on Pinpoint comprises:

a Pinpoint probe is laid on each project node, so that when a project request reaches the project node corresponding to the project request, a request identifier corresponding to the project request generated by the Pinpoint probe in the project node is obtained;

4. The method for locating the root cause of the fault according to claim 1, wherein the calculating the root cause score corresponding to each project node in the abnormal topology map based on a root cause analysis algorithm Hotspot algorithm comprises:

calculating a derived score of an associated project node corresponding to the project node by using the predicted access success rate of the project node, the response success rate of the associated project node and the change success rate;

5. The method according to claim 4, wherein the calculating a derived score of an associated project node corresponding to the project node using the predicted access success rate of the project node, the response success rate of the associated project node, and the change success rate includes:

calculating a ratio between the product and the predicted access success rate;

6. The method according to claim 5, wherein the locating a fault root according to the root score value corresponding to each project node comprises:

comparing the root factor values corresponding to the project nodes, and selecting the maximum root factor value;

7. The method according to claim 1, wherein the combining the item nodes of each anomaly and the association relationship of each anomaly to form an anomaly topology map comprises:

and removing duplication and combining the collected abnormal project nodes and the association relation of the abnormalities to obtain an abnormal topological graph.

8. A fault root cause locating device, comprising:

the combination unit is used for combining the abnormal project nodes and the abnormal incidence relations to form an abnormal topological graph;

the calculation unit is used for calculating a root factor value corresponding to each project node in the abnormal topological graph on the basis of a root factor analysis algorithm Hotspot algorithm;

9. A fault root cause locating device comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, and implement the steps of the fault root cause locating method according to any one of claims 1 to 7.

10. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for fault root cause localization according to any of claims 1-7.