CN114528175A - Micro-service application system root cause positioning method, device, medium and equipment - Google Patents

Micro-service application system root cause positioning method, device, medium and equipment Download PDF

Info

Publication number
CN114528175A
CN114528175A CN202011195383.4A CN202011195383A CN114528175A CN 114528175 A CN114528175 A CN 114528175A CN 202011195383 A CN202011195383 A CN 202011195383A CN 114528175 A CN114528175 A CN 114528175A
Authority
CN
China
Prior art keywords
abnormal
micro
service
root cause
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011195383.4A
Other languages
Chinese (zh)
Inventor
朱诗逸
欧阳晔
王云鹏
孟祥德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Asiainfo Technologies China Inc
Original Assignee
Asiainfo Technologies China Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Asiainfo Technologies China Inc filed Critical Asiainfo Technologies China Inc
Priority to CN202011195383.4A priority Critical patent/CN114528175A/en
Publication of CN114528175A publication Critical patent/CN114528175A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3692Test management for test results analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a method, a device, a medium and equipment for positioning a micro-service application system root cause, which are applied to the technical field of information. The method comprises the following steps: carrying out anomaly detection on the micro-service application system to obtain an abnormal micro-service and an abnormal moment corresponding to the abnormal micro-service; and then, detecting the calling chain of the abnormal micro-service and the corresponding micro-service node, and carrying out fault positioning according to the abnormal time and the calling logic of the calling chain to obtain a micro-service root node. According to the method and the device, the faults of the micro-service application system are quickly and accurately positioned, and the technical problems that in the prior art, the labor input workload is large, the efficiency is low, and the limitation is large in fault positioning are solved.

Description

Micro-service application system root cause positioning method, device, medium and equipment
Technical Field
The present application relates to the field of information technology, and in particular, to a method, an apparatus, a medium, and a device for locating a root cause of a micro service application system.
Background
With the increasing popularization of cloud computing and big data technologies and the rapid development of distributed technologies, more and more internet and software enterprises begin to build micro-service architectures. Under the micro-service architecture, a large and complete complex application program can be split into a plurality of micro-services with the characteristics of high cohesion, low coupling and no state, each micro-service is built around a business function, and independent deployment can be carried out through an automatic deployment mechanism to run in the own process. Typically, each micro-service deployment contains a plurality of different micro-service nodes, and the calling relationships between the micro-service nodes form calling links. In a complete call chain, if any node is delayed or abnormal, the final result is possibly abnormal, so that the stability of the micro service is crucial, and the result needs to be quickly and accurately positioned when the abnormality occurs.
The existing scheme for positioning the root cause of the micro-service fault generally needs to rely on expert experience or manually marked data to position the root cause node, so that the labor cost is high, and the positioning efficiency and accuracy of the root cause node are low.
Disclosure of Invention
The application provides a micro-service application system root cause positioning method, device, medium and equipment, which are used for solving the problems of high labor cost consumption, low efficiency and high limitation in the micro-service fault positioning process.
In a first aspect, the present application provides a method for locating a root cause of a microservice application system, including:
carrying out anomaly detection on the micro-service application system to obtain an abnormal micro-service and an abnormal moment corresponding to the abnormal micro-service;
and detecting the calling chain of the abnormal micro-service and the corresponding micro-service node, and carrying out fault positioning according to the abnormal time and the calling logic of the calling chain to obtain the micro-service root node.
In the embodiment of the present application, detecting a calling chain of an abnormal micro service and a corresponding micro service node, and performing fault location according to an abnormal time and a calling logic of the calling chain, and after obtaining a micro service root node, further including:
and acquiring platform indexes of corresponding micro services according to the micro service root factor nodes and the abnormal time to perform abnormal detection analysis, and determining the root factor indexes.
In the embodiment of the present application, performing anomaly detection on a microservice application system to obtain an abnormal microservice and an abnormal time corresponding to the abnormal microservice includes:
carrying out gold index abnormity detection on the micro-service application system by adopting an absolute threshold detection method to obtain a fault type;
and determining the abnormal micro-service and the abnormal time corresponding to the abnormal micro-service according to the fault type.
In the embodiment of the application, an absolute threshold detection method is adopted to perform gold index abnormality detection on the micro-service application system to obtain a fault type, and the method comprises the following steps:
detecting the average delay and success rate of the micro-service request of the micro-service application system;
if the average delay is not less than the preset average delay threshold and the success rate is not less than the preset success rate threshold, the fault type is time-consuming abnormity;
if the average delay is smaller than the average delay threshold, the success rate is smaller than the success rate threshold and is not zero, the fault type is abnormal in success rate;
if the average delay is smaller than the average delay threshold and the success rate is zero, the fault type is abnormal;
if the average delay is not less than the average delay threshold, the success rate is less than the success rate threshold and is not zero, the fault type is that the consumed time and the success rate are abnormal at the same time;
and if the average delay is not less than the average delay threshold and the success rate is zero, determining that the fault type is time-consuming and the database is abnormal simultaneously.
In the embodiment of the application, detecting a calling chain of an abnormal micro-service and a corresponding micro-service node, and performing fault location according to an abnormal moment and a calling logic of the calling chain to obtain a micro-service root cause node, including:
screening the calling chains of the abnormal micro-services to obtain abnormal calling chains;
according to the calling logic of the abnormal calling chain, performing abnormal analysis on the nodes of the abnormal calling chain by adopting a search algorithm to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root nodes;
and positioning and analyzing the alternative root cause nodes by adopting a calling chain positioning algorithm to obtain the micro-service root cause nodes.
In the embodiment of the application, the call chain of the abnormal micro service is screened to obtain the abnormal call chain, which includes:
counting the time consumption of a calling chain in each micro-service request of the abnormal micro-service by adopting normal distribution;
screening abnormal time-consuming data in the time-consuming statistics of the calling chain according to the abnormal time;
and determining the call chain corresponding to the abnormal time-consuming data as the abnormal call chain.
In the embodiment of the application, according to the abnormal time, screening out abnormal time-consuming data in the time-consuming statistics of the call chain, the method includes:
acquiring data belonging to a preset time period before an abnormal moment in calling chain time consumption statistics as data to be screened;
and screening out data in a confidence interval under a preset confidence level in normal distribution statistics in the data to be screened as abnormal time-consuming data.
In the embodiment of the application, according to the call logic of the abnormal call chain, a search algorithm is adopted to perform abnormal analysis on the nodes of the abnormal call chain to obtain abnormal nodes and undetermined nodes, and the abnormal nodes and the undetermined nodes are micro-service alternative root cause nodes, which includes:
counting the calling time consumption of the nodes of the abnormal calling chain by adopting normal distribution;
and performing layer-by-layer downward exploration search on the nodes of the abnormal calling chain by adopting a pruning algorithm according to the calling logic of the abnormal calling chain and the calling time consumption of the nodes to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root nodes.
In the embodiment of the application, a call chain positioning algorithm is adopted to perform positioning analysis on the alternative root cause nodes to obtain micro-service root cause nodes, and the method includes the following steps:
counting the failure times and failure rate of the alternative root cause nodes in the multiple calling processes;
and analyzing the alternative root cause nodes layer by layer according to the depth, the failure times and the failure rate of the alternative root cause nodes to determine the root cause nodes.
In the embodiment of the application, according to the micro-service root cause node and the abnormal time, the platform index of the corresponding micro-service is obtained to perform abnormal detection and analysis, and the root cause index is determined, including:
detecting platform indexes of components corresponding to the micro-service root nodes within preset time;
and performing anomaly detection analysis on the platform index by adopting a mode of combining multiple anomaly detection methods according to the platform index and the anomaly time to obtain a root cause index.
In the embodiment of the application, according to the platform index and the abnormal time, a mode of combining multiple abnormal detection methods is adopted to perform abnormal detection analysis on the platform index to obtain a root cause index, which includes:
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a root cause index;
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection result is one of the abnormalities, if the platform index is detected by adopting an absolute threshold method, the platform index is used as a root cause index;
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a normal index;
and if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method simultaneously and the obtained detection result is abnormal, if the platform index is detected by adopting an absolute threshold method and the obtained detection result is abnormal, the platform index is taken as a normal index.
In the embodiment of the application, the platform index comprises one or more of a CPU utilization rate, a memory utilization rate, a database connection number and a network transceiving queue length.
In a second aspect, the present application provides a microservice application system root cause location apparatus, comprising:
the anomaly detection module is used for carrying out anomaly detection on the micro-service application system to obtain an abnormal micro-service and an abnormal moment corresponding to the abnormal micro-service;
and the root cause node positioning module is used for detecting the calling chain of the abnormal micro-service and the corresponding micro-service node, and carrying out fault positioning according to the abnormal time and the calling logic of the calling chain to obtain the micro-service root cause node.
In an embodiment of the application, the microservice application system root cause positioning apparatus further includes:
and the root cause index determining module is used for acquiring the platform index of the corresponding micro service according to the micro service root cause node and the abnormal time to perform abnormal detection analysis and determine the root cause index.
In an embodiment of the present application, the abnormality detecting module includes:
the anomaly detection unit is used for carrying out gold index anomaly detection on the micro-service application system by adopting an absolute threshold detection method to obtain a fault type;
and the abnormity determining unit is used for determining the abnormal micro-service and the abnormal time corresponding to the abnormal micro-service according to the fault type.
In an embodiment of the present application, the abnormality detecting unit is specifically configured to:
detecting the average delay and success rate of the micro-service request of the micro-service application system;
if the average delay is not less than the preset average delay threshold and the success rate is not less than the preset success rate threshold, the fault type is time-consuming abnormity;
if the average delay is smaller than the average delay threshold, the success rate is smaller than the success rate threshold and is not zero, the fault type is abnormal in success rate;
if the average delay is smaller than the average delay threshold and the success rate is zero, the fault type is abnormal;
if the average delay is not less than the average delay threshold, the success rate is less than the success rate threshold and is not zero, the fault type is that the consumed time and the success rate are abnormal at the same time;
and if the average delay is not less than the average delay threshold and the success rate is zero, determining that the fault type is time-consuming and the database is abnormal simultaneously.
In an embodiment of the application, the root cause node locating module includes:
the screening unit is used for screening the calling chain of the abnormal micro-service to obtain the abnormal calling chain;
the searching unit is used for performing anomaly analysis on the nodes of the abnormal calling chain by adopting a searching algorithm according to the calling logic of the abnormal calling chain to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root nodes;
and the node positioning unit is used for positioning and analyzing the alternative root cause node by adopting a calling chain positioning algorithm to obtain the micro-service root cause node.
In an embodiment of the present application, the screening unit includes:
the statistical subunit is used for counting the time consumed by calling chains in each micro-service request of the abnormal micro-service by adopting normal distribution;
the screening subunit is used for screening abnormal time consumption data in the calling chain time consumption statistics according to the abnormal time;
and the calling chain determining subunit is used for determining the calling chain corresponding to the abnormal time-consuming data as the abnormal calling chain.
In an embodiment of the present application, the screening subunit is specifically configured to:
acquiring data belonging to a preset time period before an abnormal moment in calling chain time consumption statistics as data to be screened;
and screening out data in a confidence interval under a preset confidence level in normal distribution statistics in the data to be screened as abnormal time-consuming data.
In an embodiment of the present application, the search unit is specifically configured to:
counting the calling time consumption of the nodes of the abnormal calling chain by adopting normal distribution;
and performing layer-by-layer downward exploration search on the nodes of the abnormal calling chain by adopting a pruning algorithm according to the calling logic of the abnormal calling chain and the calling time consumption of the nodes to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root nodes.
In an embodiment of the present application, the node positioning unit is specifically configured to:
counting the failure times and failure rate of the alternative root cause nodes in the multiple calling processes;
and analyzing the alternative root cause nodes layer by layer according to the depth, the failure times and the failure rate of the alternative root cause nodes to determine the root cause nodes.
In an embodiment of the application, the root cause indicator determining module includes:
the platform index detection unit is used for detecting the platform index of the component corresponding to the micro-service root node within preset time;
and the root cause index determining unit is used for performing anomaly detection analysis on the platform index by adopting a mode of combining multiple anomaly detection methods according to the platform index and the anomaly time to obtain the root cause index.
In an embodiment of the application, the root cause indicator determining unit is specifically configured to:
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a root cause index;
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection result is one of the abnormalities, if the platform index is detected by adopting an absolute threshold method, the platform index is used as a root cause index;
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a normal index;
and if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method simultaneously and the obtained detection result is abnormal, if the platform index is detected by adopting an absolute threshold method and the obtained detection result is abnormal, the platform index is taken as a normal index.
In an embodiment of the application, the platform index includes one or more of a CPU utilization rate, a memory utilization rate, a database connection number, and a network transceiving queue length.
In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a microservice application system root cause localization method as described in any embodiment of the first aspect of the present application.
In a fourth aspect, the present application provides an electronic device comprising: the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the micro service application system root cause positioning method as shown in any embodiment of the first aspect of the application.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
according to the micro-service application system root cause positioning method, multi-dimensional anomaly detection is carried out on the micro-service application system, firstly, fault finding is carried out on the micro-service system to obtain abnormal micro-services and abnormal moments, then, calling chains and nodes of the abnormal micro-services are further detected to obtain root cause nodes, decoupling analysis of micro-service operation and maintenance scenes is achieved, actual services of all scenes are facilitated to be determined, complexity of algorithms of all scenes is reduced, the technical problem that a large amount of labor cost needs to be consumed in the prior art is solved, and high efficiency and stability of the micro-service application system root cause positioning method are guaranteed.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a method for locating a root cause of a micro service application system according to an embodiment of the present application;
fig. 2 is a schematic diagram of a micro service node invocation relationship in a micro service application system root cause positioning method according to an embodiment of the present application;
fig. 3 is a schematic diagram of a complete call chain of a micro-service in a root cause positioning method of a micro-service application system according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a micro service application system root cause positioning device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
In order to make the objects, technical solutions and advantages of the present application more clear, the technical solutions of the present application are described in detail below with specific examples. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Because the calling levels among the business systems are deeper and deeper, and the calling relationship is increasingly complex, the number of micro services can be thousands of, and the micro services can be dispersedly deployed on thousands of different servers, the components borne and called by various micro services are numerous, the monitoring indexes of each component are different in number, and under the complicated environment, how to troubleshoot and locate the specific components and indexes causing the micro service abnormity is difficult, and the problem root cause cannot be quickly located and solved.
The existing microservice fault root cause positioning schemes mainly comprise the following steps:
1. and (3) positioning by expert experience: after a fault occurs, the information of the head-end micro-service called by the current service is manually determined, then the calling relation and the abnormal state of the node are analyzed layer by layer based on the calling link tracking information, finally the root cause micro-service node is found, and then the possible platform index root cause is deduced according to the existing topological relation and the expert experience of the micro-service.
However, since the number of the micro-services is thousands of micro-services and the calling relationship is complicated and complicated, the labor input cost is too high in a manual troubleshooting mode based on expert prior knowledge, only partial fault location can be realized, and the method cannot adapt to the situation that the micro-service calling relationship changes frequently.
2. Random forest algorithm: and manually selecting effective indexes from a plurality of data samples, further marking abnormal root nodes corresponding to the data samples, training by using manually marked data, training a plurality of random decision trees, voting the positioning result of each decision tree, and positioning and searching the abnormal root nodes according to the voting result.
However, for the manual labeling method, relatively accurate abnormal samples in a real environment are rare, various abnormal situations are difficult to cover, and the workload of manual labeling is very large, which takes too long time.
3. Random walk algorithm: collecting multi-index data, constructing an influence graph based on abnormal characteristics and index data weight, and self-adjusting before and after analysis by a random walk algorithm to calculate root causes.
The random walk algorithm is realized based on the index weight, so that the index range influencing the micro-service operation needs to be determined based on manual experience.
In view of this, the embodiment of the present application provides a method for locating a root cause of a micro-service application system, which includes first establishing a service gold index anomaly detection model, actively discovering a service anomaly, then performing a horizontal and vertical comprehensive analysis on call chain tracking information based on a big data statistics method, locating a failed micro-service root cause node, and finally locating a true root cause causing the micro-service anomaly according to a platform index anomaly detection and root cause location model and combining a static topological relation, thereby actively detecting the service anomaly and rapidly and accurately locating the true root cause, without manual labeling or expert priori knowledge, and solving the problems of large workload, large limitation and long time consumption of manual analysis or manual labeling.
The method provided by the embodiment of the application can comprise the following steps:
firstly, gold index abnormity detection: detecting the abnormal condition of a gold index in an ESB (Enterprise Service Bus) Service by an absolute threshold detection method, and returning the abnormal time and the abnormal micro-Service name;
secondly, positioning of calling chain root causes: according to the detection result of the gold index in the first step, further detecting the abnormal conditions of all micro-service call chains of the specific micro-service in a preset time period before the abnormal time, positioning the abnormal nodes according to the call chain relation, and returning the abnormal time, the abnormal micro-service name and the root cause node of the micro-service;
thirdly, detecting platform index abnormity: and based on the micro service abnormal node positioned in the second step at a certain abnormal moment, associating the topological relation of the platform architecture, further detecting the abnormal condition of each index at the platform side, and finally finding the root cause index.
In order to better understand the micro service application system root cause positioning method provided in the embodiments of the present application, the method is described in detail below with reference to fig. 1 to 5.
In an embodiment of the present application, a method 10 for root cause positioning of a micro service application system is provided, as shown in fig. 1, the method may include the following steps:
and S1, carrying out abnormity detection on the micro service application system to obtain abnormal micro services and abnormal time corresponding to the abnormal micro services.
And S2, detecting the calling chain of the abnormal micro-service and the corresponding micro-service node, and positioning the fault according to the abnormal time and the calling logic of the calling chain to obtain the root node of the micro-service.
Under the micro-service architecture, a large and complete complex application program can be split into a plurality of micro-services with characteristics of high cohesion, low coupling and no state, as shown in fig. 2, the micro-services are schematic micro-service node call relation diagrams, the deployment of each micro-service comprises a plurality of different micro-service nodes, and call relations among the micro-service nodes form a call link. In fig. 2, the entire microservice system is divided into A, B, C three broad categories of microservice, each category of microservice includes several microservice nodes, such as class a microservice nodes including class a1, class a2, class A3, etc., each microservice node may be a component of the microservice system, which may be a terminal or a technical component (message middleware, network server) or a database, and when the system issues a service request, a call link is formed, such as a2-A3, A3-B1, B1-C1, etc.
According to the micro-service application system root cause positioning method, multi-dimensional anomaly detection is carried out on the micro-service application system, firstly, fault discovery is carried out on the micro-service system to obtain abnormal micro-services and abnormal time, then, calling chains and nodes of the abnormal micro-services are further detected to obtain root cause nodes, decoupling analysis of micro-service operation and maintenance scenes is achieved, actual services of all scenes are facilitated to be clear, complexity of algorithms of all scenes is reduced, the technical problem that a large amount of labor cost needs to be consumed in the prior art is solved, and high efficiency and stability of the micro-service application system root cause positioning method are guaranteed.
A possible implementation manner is provided in the embodiment of the present application, and step S1 may specifically include:
and S11, performing gold index abnormity detection on the micro-service application system by adopting an absolute threshold detection method to obtain a fault type.
And S12, determining the abnormal micro-service and the abnormal time corresponding to the abnormal micro-service according to the fault type.
The golden index detection is to detect the key index of the micro service, and represents that the processing performance of the system is reduced when the service response time is prolonged; requesting a sudden decrease in success rate or a sudden increase in error rate may present problems on behalf of the system's own processing logic. Therefore, different service indexes of various micro services are classified into two categories of response time and request success rate for anomaly detection, and the efficiency and effect of micro service fault detection are improved.
Another possible implementation manner is provided in the embodiment of the present application, and step S11 may specifically include:
detecting the average delay and success rate of the micro-service request of the micro-service application system;
if the average delay is not less than the preset average delay threshold and the success rate is not less than the preset success rate threshold, the fault type is time-consuming abnormity;
if the average delay is smaller than the average delay threshold, the success rate is smaller than the success rate threshold and is not zero, the fault type is abnormal in success rate;
if the average delay is smaller than the average delay threshold and the success rate is zero, the fault type is abnormal;
if the average delay is not less than the average delay threshold, the success rate is less than the success rate threshold and is not zero, the fault type is that the consumed time and the success rate are abnormal at the same time;
and if the average delay is not less than the average delay threshold and the success rate is zero, determining that the fault type is time-consuming and the database is abnormal simultaneously.
In practical application, an average delay threshold and a success rate threshold can be set based on an abnormal proportion in practical experience, for example, the average delay threshold takes 1s, and the success rate threshold takes 90%; in the embodiment of the application, the golden index abnormity detection is carried out on the micro-service application system by adopting an absolute threshold detection method and a classification idea, and the fault types are classified carefully, so that the technical problem that various abnormal conditions are difficult to cover in the prior art is solved, and the comprehensive coverage of different fault types in actual services is achieved.
Another possible implementation manner is provided in the embodiment of the present application, and step S2 may specifically include:
s21, screening the calling chain of the abnormal micro-service to obtain an abnormal calling chain;
specifically, in this embodiment, a complete call chain of microservice requests exhibits link-unidirectional propagation, and each call chain has a unique identifier to distinguish it from other call chains.
S22, according to the calling logic of the abnormal calling chain, performing abnormal analysis on the nodes of the abnormal calling chain by adopting a search algorithm to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root nodes;
and S23, positioning and analyzing the alternative root cause node by adopting a calling chain positioning algorithm to obtain the micro-service root cause node.
Another possible implementation manner is provided in the embodiment of the present application, and step S21 may specifically include:
s211, counting the time consumed by a calling chain in each micro-service request of the abnormal micro-service by adopting normal distribution; the call chain time consumption is the sum of the time consumption of all microservice nodes contained in the call chain.
S212, screening abnormal time consumption data in the time consumption statistics of the calling chain according to the abnormal time.
And S213, determining the call chain corresponding to the abnormal time-consuming data as the abnormal call chain.
Another possible implementation manner is provided in this embodiment of the present application, and step S212 specifically includes:
acquiring data belonging to a preset time period before an abnormal moment in calling chain time consumption statistics as data to be screened;
and screening out data in a confidence interval under a preset confidence level in normal distribution statistics in the data to be screened as abnormal time-consuming data. In order to improve the efficiency of data analysis, data within 5 minutes before the abnormal time can be obtained, and then data belonging to a 90% confidence interval in the data can be screened out.
In another possible implementation manner provided in the embodiment of the present application, step S22 specifically includes:
counting the calling time consumption of the nodes of the abnormal calling chain by adopting normal distribution;
and performing layer-by-layer downward exploration search on the nodes of the abnormal calling chain by adopting a pruning algorithm according to the calling logic of the abnormal calling chain and the calling time consumption of the nodes to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root nodes. Specifically, node time consumption data belonging to 5 minutes before the abnormal time in the calling time consumption normal distribution statistics of the node can be obtained, and if the calling time consumption of the node is greater than 45% of the time consumption of a parent node of the node, the node is considered to meet the time consumption abnormal condition; and if the calling time consumption of the node is not more than 45% of the time consumption of the parent node, the node is considered not to meet the time-consuming exception condition. The abnormal node and the node to be determined may be:
if the node meets the time-consuming exception condition or the node fails to be called, the node is an exception node;
and if the node does not meet the time-consuming abnormal condition or the node is successfully called, but the time consumption of the father node of the node is more than twice of the sum of the time consumption of all nodes on the same layer of the node, the node is the node to be determined. In general, when the node is determined to be a pending node or no child node exists in the node by the search, the search is stopped.
The following describes in detail the process of performing layer-by-layer search on the nodes of the abnormal call chain by using a pruning algorithm, with reference to the micro-service complete call chain sample shown in fig. 3:
recording node information { 'A _ 001': 1,1,0] } (node name: [ abnormal times, depth, whether abnormal (abnormal 0, undetermined 1) ]);
continuing to search the node B _001 in a downward mode: the B _001 node meets the time-consuming exception condition, then the node information is recorded as { 'A _ 001': 1,1,0], 'B _ 001': [1,2,0], };
if B _001 is known to be an abnormal node, continuing to search the B _003 node: the B _003 node does not satisfy the time consuming but is a call failing node, then node information is recorded { 'A _ 001': 1,1,0 ',' B _001 ': 1,2, 0', 'B _ 003': 1,3,0] };
if B _003 is known as an abnormal node, the search continues to find the C _015 and C _016 nodes: if the C _15 node does not meet the time-consuming exception condition but is a node with failed call, recording node information { 'A _ 001': 1,1,0], 'B _ 001': 1,2,0], 'B _ 003': 1,3,0], 'C _ 015': 1,4,0] };
if the C _016 node does not meet the time-consuming exception condition and the calling is successful, the C _016 node does not belong to the exception node; the time consumed for analyzing B _003 is 20s and the time consumed for B _003 child nodes is respectively self-call, C _015, C _016(4s, 3s, 2s), more than one time of the sum of the time consumed for child node calls, then C _016 is the node to be determined, then record node information { 'a _ 001': [1,1,0], 'B _ 001': [1,2,0], 'B _ 003': [1,3,0], 'C _ 015': [1,4,0] 'C _ 016': [1,4, 1] };
if the C _015 is known as an abnormal node, continuing to search the D _003 node in a downward mode: d _003 node does not satisfy the time consuming exception condition and the call is successful, then D _003 does not belong to the exception node; analyzing that the time consumption of C _015 is 3s, the time consumption of the D _003 of the C _015 child node is 1s, the time consumption exceeds more than one time of calling the child node, then D _003 is the node to be determined, and then recording the node information { 'A _ 001': [1,1,0], 'B _ 001': [1,2,0], 'B _ 003': [1,3,0], 'C _ 015': [1,4,0], 'C _ 016': [1,4, 1], 'D _ 003': [1, 5, 1] };
knowing that C _016 is a pending node and D _003 has no child nodes, the search is stopped.
The calling time consumption and whether the calling is successful can reflect node fault information, and if the bottom-layer node consumes more time or fails in calling, the time consumed by the upper-layer node is correspondingly increased or fails in calling. The root node is positioned based on the calling logic of the micro-service calling chain, so that the efficiency of positioning analysis of the micro-service node can be improved.
Another possible implementation manner is provided in the embodiment of the present application, and step S23 specifically includes:
counting the failure times and failure rate of the alternative root cause nodes in the multiple calling processes;
and analyzing the alternative root cause nodes layer by layer according to the depth, the failure times and the failure rate of the alternative root cause nodes to determine the root cause nodes.
According to the calling logic of the calling chain, the root cause transmission is from bottom to top, so that the more nodes at the bottom layer are likely to become root cause nodes, the distribution state of abnormal nodes can be known according to the statistical micro-service alternative root cause node information, and the specific calling chain positioning algorithm is as follows:
counting the total number of call chains, the call depth of the nodes, the call times of the nodes and the abnormal times according to the node time consumption data which belongs to 5 minutes before the abnormal time in the obtained call time consumption normal distribution statistics of the nodes, and starting analysis from the deepest alternative root cause node:
if the number of the abnormal times of the alternative root cause node is satisfied:
mabnormality (S)≥0.8(deep-2)×0.05×nCalling chain
And the abnormality rate satisfies pAbnormality (S)=mAbnormality (S)÷nNode pointAnd if the node is more than or equal to 50 percent, determining the node as the root cause node to be determined.
If the candidate root cause node waiting times meet the following conditions:
lto be determined≥0.8(deep-2)×0.1×nCalling chain
And the undetermined rate satisfies pTo be determined=lTo be determined÷nNode pointAnd > 50%, determining the node as a possible root cause node.
Wherein m isAbnormality (S)The number of times of being an abnormal node is determined for the node, lTo be determinedDetermining the frequency of the node to be determined, deep is the hierarchy depth of the node, nCalling chainFor the total number of call chains, nNode pointThe number of calls for the node.
If a single root cause node to be determined exists in a certain depth, directly determining the node as a micro-service root cause node; if a plurality of root cause nodes to be determined exist in a certain depth, determining the nodes with more abnormal times as micro-service root cause nodes; if a plurality of root nodes to be determined exist in the hierarchical depth and the abnormal times are the same, returning the nodes with higher abnormal rate as the micro-service root nodes; and if the abnormal times are equal to the abnormal rate, returning the docker node as the microservice root node.
In a root cause node positioning scene, the calling logic of the calling chain is effectively utilized to traverse and analyze the state of the micro service nodes of the abnormal calling chain, a pruning strategy is adopted to accelerate the searching speed, and then the calling chain positioning algorithm is adopted to determine the root cause nodes through the failure times and the failure rate, so that the technical problems of long time consumption and low efficiency of failure root cause positioning in the prior art are solved, and the technical effect of quickly and accurately positioning the root cause nodes of the micro service calling chain is realized.
Another possible implementation manner is provided in the embodiment of the present application, and after step S2, the method 10 may further include the following steps:
and S3, acquiring platform indexes of corresponding micro services according to the micro service root cause nodes and abnormal time for carrying out abnormal detection analysis, and determining the root cause indexes.
In this embodiment, the platform metrics may include: CPU utilization rate, memory utilization rate, database connection number and network transceiving queue length.
Another possible implementation manner is provided in the embodiment of the present application, and step S3 specifically includes:
s31, detecting platform indexes of components corresponding to the micro-service root nodes within preset time;
and S32, performing anomaly detection analysis on the platform index by adopting a mode of combining multiple anomaly detection methods according to the platform index and the anomaly time to obtain a root cause index.
Another possible implementation manner is provided in the embodiment of the present application, and step S32 may specifically include the following steps:
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a root cause index; if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection result is one of the abnormalities, if the platform index is detected by adopting an absolute threshold method, the platform index is used as a root cause index;
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a normal index;
and if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method simultaneously and the obtained detection result is abnormal, if the platform index is detected by adopting an absolute threshold method and the obtained detection result is abnormal, the platform index is taken as a normal index.
In actual operation, three statistical methods of deviation analysis, fluctuation point analysis and absolute threshold method can be packaged and integrated into a function, the input of function call is the detection time and the observed index column, the detection time is usually the abnormal time, the observed index column is all platform index data acquired within the preset time, and the output of function call is whether the index is abnormal within 5 minutes before the abnormal time.
Specifically, the specific process of detecting the platform index by using the deviation degree analysis method comprises the following steps: obtaining the average value and the maximum value in the data column according to all the input platform index data, setting a deviation threshold value according to the data input, and calculating the index data corresponding to a certain moment according to the following formulaDegree of deviation: dispi=(seriesi-medianseries)÷maxseries
If Dispi>ThresdispIf the index data is abnormal at the moment, the index data is abnormal;
if Dispi≤ThresdispIf the index data is normal at the moment, the index data is normal;
wherein, DispiSeries corresponding to the deviation of the index data at a certain timeiAs index data at that time, meanseriesIs the mean, max, in the data columnseriesIs the maximum value in the data column, ThresdispIs a deviation threshold.
The fluctuation point analysis method mainly adopts a sliding window statistical method to capture the fluctuation characteristic of index data along with time, and the specific process of detecting the platform index by adopting the fluctuation point analysis method comprises the following steps: time slicing is carried out on all input platform index data by adopting a window length of 3 minutes, the time window slice data is counted by adopting normal distribution, and the mean value and the variance of the slice data of each time window are counted:
if | news materialsi-unewseries|≥3×σnewseriesIf the index data is abnormal at the moment, the index data is abnormal;
if | news materialsi-unewseries|<3×σnewseriesIf so, the index data is normal at the moment;
among them, news materialsiSlicing the time window of the input index column unewseriesMean, σ, of a sequence of slices for a time windownewseriesThe variance of the slice sequence is the time window.
The specific process of detecting the platform index by adopting the absolute threshold method comprises the steps of firstly arranging input platform index data according to the sequence from small to large, taking the value in the middle sequence as the absolute threshold,
if seriesi≥mediannormalseriesIf the index data is abnormal at the moment, the index data is abnormal;
if seriesi<mediannormalseriesThen, thenThe index data is normal at the moment;
wherein, seriesiAs index data at that time, meannormalseriesThe numerical values in the middle order are taken after the input platform index data are arranged from small to large.
In a root cause index detection scene, the idea of ensemble learning is adopted to combine various abnormal detection methods, the robustness of an index detection algorithm is enhanced, the accuracy of a root cause index confirmation method is improved, and the business experience is integrated into the index detection algorithm through an absolute threshold method, so that the practicability of the index detection algorithm is enhanced.
Based on the same inventive concept, the embodiment of the application also provides a micro-service application system root cause positioning device. As shown in fig. 4, the microservice application system root cause locator 20 may include: an anomaly detection module 201 and a root cause node location module 202, wherein,
an anomaly detection module 201, configured to perform anomaly detection on the microservice application system to obtain an abnormal microservice and an abnormal time corresponding to the abnormal microservice;
and the root cause node positioning module 202 is configured to detect the calling chain of the abnormal micro service and the corresponding micro service node, and perform fault positioning according to the abnormal time and the calling logic of the calling chain to obtain the micro service root cause node.
In this embodiment, another possible implementation manner is provided, and the microservice application system root cause positioning apparatus 20 further includes:
and the root cause index determining module is used for acquiring the platform index of the corresponding micro service according to the micro service root cause node and the abnormal time to perform abnormal detection analysis and determine the root cause index.
In an embodiment of the present application, another possible implementation manner is provided, and the abnormality detecting module 201 may include:
the anomaly detection unit is used for carrying out gold index anomaly detection on the micro-service application system by adopting an absolute threshold detection method to obtain a fault type;
and the abnormity determining unit is used for determining the abnormal micro-service and the abnormal time corresponding to the abnormal micro-service according to the fault type.
In an embodiment of the present application, another possible implementation manner is provided, and the abnormality detecting unit may be specifically configured to:
detecting the average delay and success rate of the micro-service request of the micro-service application system;
if the average delay is not less than the preset average delay threshold and the success rate is not less than the preset success rate threshold, the fault type is time-consuming abnormity;
if the average delay is smaller than the average delay threshold, the success rate is smaller than the success rate threshold and is not zero, the fault type is abnormal in success rate;
if the average delay is smaller than the average delay threshold and the success rate is zero, the fault type is abnormal;
if the average delay is not less than the average delay threshold, the success rate is less than the success rate threshold and is not zero, the fault type is that the consumed time and the success rate are abnormal at the same time;
and if the average delay is not less than the average delay threshold and the success rate is zero, determining that the fault type is time-consuming and the database is abnormal simultaneously.
In an embodiment of the present application, another possible implementation manner is provided, and the root node positioning module 202 may include:
the screening unit is used for screening the calling chain of the abnormal micro-service to obtain the abnormal calling chain;
the searching unit is used for performing anomaly analysis on the nodes of the abnormal calling chain by adopting a searching algorithm according to the calling logic of the abnormal calling chain to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root nodes;
and the node positioning unit is used for positioning and analyzing the alternative root cause node by adopting a calling chain positioning algorithm to obtain the micro-service root cause node.
Another possible implementation manner is provided in the embodiment of the present application, and the screening unit may include:
the statistical subunit is used for counting the time consumed by calling chains in each micro-service request of the abnormal micro-service by adopting normal distribution;
the screening subunit is used for screening abnormal time consumption data in the calling chain time consumption statistics according to the abnormal time;
and the calling chain determining subunit is used for determining the calling chain corresponding to the abnormal time-consuming data as the abnormal calling chain.
In an embodiment of the present application, another possible implementation manner is provided, and the screening subunit is specifically configured to:
acquiring data belonging to a preset time period before an abnormal moment in calling chain time consumption statistics as data to be screened;
and screening out data in a confidence interval under a preset confidence level in normal distribution statistics in the data to be screened as abnormal time-consuming data.
In an embodiment of the present application, another possible implementation manner is provided, and the search unit is specifically configured to:
counting the calling time consumption of the nodes of the abnormal calling chain by adopting normal distribution;
and performing layer-by-layer downward exploration search on the nodes of the abnormal calling chain by adopting a pruning algorithm according to the calling logic of the abnormal calling chain and the calling time consumption of the nodes to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root nodes.
In an embodiment of the present application, another possible implementation manner is provided, and the node positioning unit is specifically configured to:
counting the failure times and failure rate of the alternative root cause nodes in the multiple calling processes;
and analyzing the alternative root cause nodes layer by layer according to the depth, the failure times and the failure rate of the alternative root cause nodes to determine the root cause nodes.
In an embodiment of the present application, another possible implementation manner is provided, and the root cause indicator determining module may include:
the platform index detection unit is used for detecting the platform index of the component corresponding to the micro-service root node within preset time;
and the root cause index determining unit is used for performing anomaly detection analysis on the platform index by adopting a mode of combining multiple anomaly detection methods according to the platform index and the anomaly time to obtain the root cause index.
In an embodiment of the present application, another possible implementation manner is provided, and the root cause indicator determining unit is specifically configured to:
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a root cause index;
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection result is one of the abnormalities, if the platform index is detected by adopting an absolute threshold method, the platform index is used as a root cause index;
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a normal index;
and if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method simultaneously and the obtained detection result is abnormal, if the platform index is detected by adopting an absolute threshold method and the obtained detection result is abnormal, the platform index is taken as a normal index.
In another possible implementation manner provided in the embodiment of the present application, the platform index may include one or more of a CPU utilization rate, a memory utilization rate, a database connection number, and a network transceiving queue length.
The details of the root cause positioning device of the micro service application system provided in the embodiment of the present application may refer to the root cause positioning method of the micro service application system provided in the above embodiment, and the beneficial effects that the root cause positioning device of the micro service application system provided in the embodiment of the present application can achieve are the same as the root cause positioning method of the micro service application system provided in the above embodiment, and are not described again here.
The application of the embodiment of the application has at least the following beneficial effects:
according to the micro-service application system fault diagnosis method and system, multi-dimensional abnormity detection is carried out on the micro-service application system, fault discovery is carried out on the micro-service system to obtain abnormal micro-services and abnormal time, then a calling chain and a node of the abnormal micro-services are further detected to obtain a root cause node, decoupling analysis of micro-service operation and maintenance scenes is achieved, two independent modules are correspondingly formed, actual services of all scenes can be determined, complexity of algorithms of all scenes is reduced, the technical problem that a large amount of labor cost needs to be consumed in the prior art is solved, and high efficiency and stability of the root cause positioning device of the micro-service application system are guaranteed.
Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
Based on the same inventive concept, the embodiment of the present application further provides an electronic device 30, as shown in fig. 5, the electronic device includes a processor 301, a memory 302, and a computer program stored on the memory 302 and capable of running on the processor 301, and the steps of the method in the embodiment are implemented when the processor 301 executes the program.
Fig. 5 is a schematic diagram of a hardware structure of an electronic device for executing a method for locating a root cause of a microservice application system according to an embodiment of the present invention, as shown in fig. 5, the electronic device includes one or more processors 301 and a memory 302, where one processor 301 is taken as an example in fig. 5.
The electronic device executing the micro service application system root cause positioning method may further include: an input device 303 and an output device 304.
The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and are exemplified by the bus 305 in fig. 5.
Processor 301 may be a Central Processing Unit (CPU). The processor 301 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (16)

1. A micro-service application system root cause positioning method is characterized by comprising the following steps:
carrying out anomaly detection on the micro-service application system to obtain an abnormal micro-service and an abnormal moment corresponding to the abnormal micro-service;
and detecting the calling chain of the abnormal micro-service and the corresponding micro-service node, and carrying out fault positioning according to the abnormal moment and the calling logic of the calling chain to obtain a micro-service root node.
2. The method for root cause location of micro-service application system according to claim 1, wherein the detecting the calling chain of the abnormal micro-service and the corresponding micro-service node, and performing fault location according to the abnormal time and the calling logic of the calling chain, after obtaining the root cause node of micro-service, further comprises:
and acquiring a platform index of the corresponding micro service according to the micro service root cause node and the abnormal time to perform abnormal detection analysis, and determining the root cause index.
3. The method for positioning root cause of microservice application system according to claim 1 or 2, wherein the step of performing anomaly detection on the microservice application system to obtain an abnormal microservice and an abnormal time corresponding to the abnormal microservice comprises:
carrying out gold index abnormity detection on the micro-service application system by adopting an absolute threshold detection method to obtain a fault type;
and determining abnormal micro-services and abnormal moments corresponding to the abnormal micro-services according to the fault types.
4. The method for root cause location of micro-service application system according to claim 3, wherein the performing gold index anomaly detection on the micro-service application system by using an absolute threshold detection method to obtain a fault type comprises:
detecting the average delay and success rate of the micro-service request of the micro-service application system;
if the average delay is not less than a preset average delay threshold and the success rate is not less than a preset success rate threshold, determining that the fault type is time-consuming abnormity;
if the average delay is smaller than the average delay threshold value, and the success rate is smaller than the success rate threshold value and is not zero, the fault type is abnormal in success rate;
if the average delay is smaller than the average delay threshold and the success rate is zero, the fault type is abnormal in the database;
if the average delay is not less than the average delay threshold value, and the success rate is less than the success rate threshold value and is not zero, the fault type is that the consumed time and the success rate are abnormal at the same time;
and if the average delay is not less than the average delay threshold and the success rate is zero, determining that the fault type is time-consuming and the database is abnormal simultaneously.
5. The method for root cause location of micro-service application system according to any one of claims 1,2 or 4, wherein the detecting the calling chain of the abnormal micro-service and the corresponding micro-service node, and performing fault location according to the abnormal time and the calling logic of the calling chain to obtain the micro-service root cause node comprises:
screening the calling chain of the abnormal micro service to obtain an abnormal calling chain;
according to the calling logic of the abnormal calling chain, performing abnormal analysis on the nodes of the abnormal calling chain by adopting a search algorithm to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root nodes;
and positioning and analyzing the alternative root cause node by adopting a calling chain positioning algorithm to obtain the micro-service root cause node.
6. The method for root cause positioning of micro-service application systems according to claim 5, wherein the screening the call chain of the abnormal micro-service to obtain an abnormal call chain comprises:
counting the time consumption of a calling chain in each micro-service request of the abnormal micro-service by adopting normal distribution;
screening abnormal time-consuming data in the calling chain time-consuming statistics according to the abnormal time;
and determining the call chain corresponding to the abnormal time-consuming data as an abnormal call chain.
7. The micro-service application system root cause positioning method according to claim 6, wherein the screening out abnormal time-consuming data in the call chain time-consuming statistics according to the abnormal time comprises:
acquiring data belonging to a preset time period before the abnormal moment in the calling chain time consumption statistics as data to be screened;
and screening out data in a confidence interval under a preset confidence level in the normal distribution statistics in the data to be screened as abnormal time-consuming data.
8. The method for positioning root cause of micro-service application system according to claim 5, wherein the step of performing anomaly analysis on the nodes of the abnormal call chain by using a search algorithm according to the call logic of the abnormal call chain to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root cause nodes, comprises the steps of:
counting the calling time consumption of the nodes of the abnormal calling chain by adopting normal distribution;
and performing layer-by-layer downward exploration search on the nodes of the abnormal calling chain by adopting a pruning algorithm according to the calling logic of the abnormal calling chain and the calling time consumption of the nodes to obtain abnormal nodes and undetermined nodes, wherein the abnormal nodes and the undetermined nodes are micro-service alternative root cause nodes.
9. The method for root cause location of micro-service application system according to claim 5, wherein the performing location analysis on the candidate root cause node by using a call chain location algorithm to obtain a micro-service root cause node comprises:
counting the failure times and failure rate of the alternative root cause node in the multiple calling processes;
and analyzing the alternative root cause nodes layer by layer according to the depth, the failure times and the failure rate of the alternative root cause nodes to determine the root cause nodes.
10. The method for root cause location of micro-service application system according to claim 2, wherein the obtaining platform indexes of corresponding micro-services according to the micro-service root cause nodes and the abnormal time to perform abnormal detection analysis and determine root cause indexes comprises:
detecting a platform index of a component corresponding to the micro-service root node within a preset time;
and performing anomaly detection analysis on the platform index by adopting a mode of combining multiple anomaly detection methods according to the platform index and the anomaly time to obtain a root cause index.
11. The method for root cause location of micro-service application system according to claim 10, wherein the performing anomaly detection analysis on the platform index by combining a plurality of anomaly detection methods according to the platform index and the anomaly time to obtain the root cause index comprises:
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a root cause index; if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection result is one of the abnormalities, if the platform index is detected by adopting an absolute threshold method, the platform index is used as a root cause index;
if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection results are all abnormal, taking the platform index as a normal index;
and if the platform index is detected by adopting a deviation analysis method and a fluctuation point analysis method at the same time, and the obtained detection result is abnormal, if the platform index is detected by adopting an absolute threshold method, the platform index is taken as a normal index.
12. The method of claim 11, wherein the platform metrics include one or more of CPU utilization, memory utilization, number of database connections, and network transceiving queue length.
13. A micro service application system root cause positioning device is characterized by comprising:
the anomaly detection module is used for carrying out anomaly detection on the micro-service application system to obtain an abnormal micro-service and an abnormal moment corresponding to the abnormal micro-service;
and the root cause node positioning module is used for detecting the calling chain of the abnormal micro service and the corresponding micro service node, and performing fault positioning according to the abnormal moment and the calling logic of the calling chain to obtain the micro service root cause node.
14. The microservice application system root locator of claim 13, wherein the microservice application system root locator further comprises:
and the root cause index determining module is used for acquiring the platform index of the corresponding micro service according to the micro service root cause node and the abnormal time to perform abnormal detection analysis and determine the root cause index.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the microservice application system root location method according to any one of claims 1 to 12.
16. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the micro service application system root cause positioning method according to any one of claims 1-12.
CN202011195383.4A 2020-10-30 2020-10-30 Micro-service application system root cause positioning method, device, medium and equipment Pending CN114528175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011195383.4A CN114528175A (en) 2020-10-30 2020-10-30 Micro-service application system root cause positioning method, device, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011195383.4A CN114528175A (en) 2020-10-30 2020-10-30 Micro-service application system root cause positioning method, device, medium and equipment

Publications (1)

Publication Number Publication Date
CN114528175A true CN114528175A (en) 2022-05-24

Family

ID=81618508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011195383.4A Pending CN114528175A (en) 2020-10-30 2020-10-30 Micro-service application system root cause positioning method, device, medium and equipment

Country Status (1)

Country Link
CN (1) CN114528175A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118574A (en) * 2022-06-07 2022-09-27 马上消费金融股份有限公司 Data processing method, device and storage medium
CN115333921A (en) * 2022-08-20 2022-11-11 海南大学 Micro-service abnormal root cause positioning method and device
CN115392812A (en) * 2022-10-31 2022-11-25 成都飞机工业(集团)有限责任公司 Abnormal root cause positioning method, device, equipment and medium
CN116170514A (en) * 2023-04-21 2023-05-26 华能信息技术有限公司 Service policy calling implementation method and system for middle-station business

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118574A (en) * 2022-06-07 2022-09-27 马上消费金融股份有限公司 Data processing method, device and storage medium
CN115333921A (en) * 2022-08-20 2022-11-11 海南大学 Micro-service abnormal root cause positioning method and device
CN115333921B (en) * 2022-08-20 2024-03-29 海南大学 Micro-service abnormal root cause positioning method and device
CN115392812A (en) * 2022-10-31 2022-11-25 成都飞机工业(集团)有限责任公司 Abnormal root cause positioning method, device, equipment and medium
CN116170514A (en) * 2023-04-21 2023-05-26 华能信息技术有限公司 Service policy calling implementation method and system for middle-station business

Similar Documents

Publication Publication Date Title
CN110659173B (en) Operation and maintenance system and method
CN114528175A (en) Micro-service application system root cause positioning method, device, medium and equipment
US11442803B2 (en) Detecting and analyzing performance anomalies of client-server based applications
CN111858123B (en) Fault root cause analysis method and device based on directed graph network
US8655623B2 (en) Diagnostic system and method
US10177984B2 (en) Isolation of problems in a virtual environment
US9122784B2 (en) Isolation of problems in a virtual environment
US20120054554A1 (en) Problem isolation in a virtual environment
CN110716842B (en) Cluster fault detection method and device
US11144376B2 (en) Veto-based model for measuring product health
CN103746829A (en) Cluster-based fault perception system and method thereof
CN111611100B (en) Transaction fault detection method, device, computing equipment and medium
CN114465874B (en) Fault prediction method, device, electronic equipment and storage medium
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN112559237B (en) Operation and maintenance system troubleshooting method and device, server and storage medium
CN111859047A (en) Fault solving method and device
WO2017039506A1 (en) Method and network node for localizing a fault causing performance degradation of service
CN116166505B (en) Monitoring platform, method, storage medium and equipment for dual-state IT architecture in financial industry
CN115237717A (en) Micro-service abnormity detection method and system
CN116089224B (en) Alarm analysis method, alarm analysis device, calculation node and computer readable storage medium
CN108108445A (en) A kind of data intelligence processing method and system
CN111913824B (en) Method for determining data link fault cause and related equipment
US20230105304A1 (en) Proactive avoidance of performance issues in computing environments
CN113516174A (en) Call chain abnormality detection method, computer device, and readable storage medium
CN113537337A (en) Training method, abnormality detection method, apparatus, device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination