CN111796956A - Distributed system fault diagnosis method, device, equipment and storage medium - Google Patents

Distributed system fault diagnosis method, device, equipment and storage medium Download PDF

Info

Publication number
CN111796956A
CN111796956A CN202010575458.5A CN202010575458A CN111796956A CN 111796956 A CN111796956 A CN 111796956A CN 202010575458 A CN202010575458 A CN 202010575458A CN 111796956 A CN111796956 A CN 111796956A
Authority
CN
China
Prior art keywords
distributed system
subsystems
message exchange
fault diagnosis
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010575458.5A
Other languages
Chinese (zh)
Inventor
刘利
刘中原
牛姣姣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202010575458.5A priority Critical patent/CN111796956A/en
Publication of CN111796956A publication Critical patent/CN111796956A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity

Abstract

The application relates to the field of software monitoring, and particularly discloses a distributed system fault diagnosis method, a device, equipment and a storage medium, wherein the method comprises the following steps: receiving abnormal information sent by the distributed system; acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events; constructing a diagnostic tree based on the causal graph; performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters; and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system. The method and the device have the advantages that fault diagnosis is carried out on faults occurring in the distributed system, the source of the faults is identified, and the fault detection efficiency is improved.

Description

Distributed system fault diagnosis method, device, equipment and storage medium
Technical Field
The present application relates to the field of fault diagnosis, and in particular, to a distributed system fault diagnosis method, apparatus, device, and storage medium.
Background
Currently, with the continued development of distributed computer infrastructure, the distributed multi-tier application is relied upon behind the ATM network or e-commerce web site, and the consequences of distributed system downtime can be catastrophic.
However, due to the complex structure of the distributed system, the propagation speed of the error-reporting alarm is high, and many application programs are reasons such as black boxes and high delay for the diagnostician, the fault of a certain subsystem in the system may cause multiple monitoring index abnormalities and a large number of test failures, and it is often difficult to quickly locate the fault and determine the fault reason.
Therefore, how to diagnose faults occurring in the distributed system and identify the sources of the faults become an urgent problem to be solved.
Disclosure of Invention
The application provides a distributed system fault diagnosis method, a distributed system fault diagnosis device and a storage medium, which are used for carrying out fault diagnosis on faults occurring in a distributed system and identifying the sources of the faults.
In a first aspect, the present application provides a distributed system fault diagnosis method, including:
receiving abnormal information sent by the distributed system;
acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events;
constructing a diagnostic tree based on the causal graph;
performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters;
and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.
In a second aspect, the present application further provides a distributed system fault diagnosis apparatus, including:
the abnormity receiving module is used for receiving abnormity information sent by the distributed system;
the event acquisition module is used for acquiring message exchange events among subsystems in the distributed system based on the system abnormal information and drawing a cause-and-effect graph based on the message exchange events;
a diagnostic tree construction module for constructing a diagnostic tree based on the causal graph;
the diagnostic test module is used for carrying out diagnostic test on the subsystems in the distributed system to obtain estimation parameters;
and the probability diagnosis module is used for calculating the alarm propagation path probability of the sub-system in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.
In a third aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is used for executing the computer program and realizing the distributed system fault diagnosis method when the computer program is executed.
In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the distributed system fault diagnosis method as described above.
The application discloses a distributed system fault diagnosis method, a device, equipment and a storage medium, wherein abnormal information sent by a distributed system is received; acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events; constructing a diagnostic tree based on the causal graph; performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters; and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system. And calculating the alarm propagation path probability of each subsystem according to the constructed diagnosis tree, thereby realizing fault diagnosis of faults occurring in the distributed system and improving the speed of fault diagnosis.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a distributed system fault diagnosis method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating steps for creating a cause and effect graph based on the message exchange events according to an embodiment of the present application;
FIG. 3 is a causal diagram illustration of a subsystem in a distributed system as provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a diagnostic tree of a subsystem in a distributed system according to an embodiment of the present disclosure;
FIG. 5 is a flow chart illustrating sub-steps of a distributed system fault diagnosis method provided by an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating steps for calculating an alarm propagation path probability of a sub-system according to an embodiment of the present application;
fig. 7 is a schematic block diagram of a distributed system fault diagnosis apparatus according to an embodiment of the present application;
fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the application provides a distributed system fault diagnosis method and device, computer equipment and a storage medium. The distributed system fault diagnosis method can be used for carrying out fault diagnosis on the distributed system, and the fault diagnosis speed is improved.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flowchart of a distributed system fault diagnosis method according to an embodiment of the present application. The fault diagnosis method of the distributed system realizes fault probability detection in the distributed system by constructing a diagnosis tree and calculating the probability of a propagation path.
As shown in fig. 1, the distributed system fault diagnosis method specifically includes: step S101 to step S105.
S101, receiving abnormal information sent by the distributed system.
The monitor is used for monitoring the running condition of the distributed system, and when a subsystem or middleware of the distributed system fails, an abnormal message is thrown out; or when the upstream system is down, the downstream system can not receive the message so as to throw the abnormal message. The monitor is used for receiving messages thrown by each subsystem. When an abnormal message identification is found in the message, the occurrence of the fault is detected.
In particular, the monitor may forward messages by modifying a middleware layer in the distributed system or obtain protocol messages by a passive listening mechanism of the monitor.
S102, acquiring a message exchange event among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange event.
Specifically, the monitor acquires message exchange events among the subsystems in the distributed system, and draws a cause-and-effect graph according to the acquired message exchange events.
In some embodiments, as shown in fig. 2, the drawing of the cause-and-effect graph based on the message exchange event specifically includes step S1021 and step S1022.
S1021, determining the occurrence sequence of the message exchange events.
Since the communication connections are asynchronous, the order in which message exchange events received by the monitor occur may differ from the actual order in which they occur in the subsystems within the distributed system, and therefore, the actual order in which the message exchange events occur needs to be determined.
In some embodiments, the determining the order of occurrence of the message exchange events comprises: acquiring an event timestamp of the message exchange event; and determining the occurrence sequence of the message exchange events according to the event time stamps.
Specifically, for two different subsystems in the distributed system, the order in which the monitor acquires the message exchange events sent by the two subsystems may be different from the order in which the subsystems send, and therefore, in order to obtain the correct order of the message exchange events sent by the multiple systems and thus obtain the correct cause-and-effect relationship, so as to draw a cause-and-effect graph, the time stamp information of the distributed system itself may be attached when the subsystems send the messages. So that the monitor can determine the true front-to-back order from the timestamp information (i.e., logic clock LC) attached to the message.
For example, for three different subsystems a, B and D. A sends a message to D first, B sends a message to D later. Because the messages sent by a and B arrive at the monitor via different paths, the time recorded by the monitor when the message exchange event is received may be different from the time when a and B send the messages themselves, and the front-back order relationship of message sending may also change. And acquiring the sequence of the message exchange events among different subsystems through the time stamps, so that the cause-effect graph is drawn according to the sequence, and the accuracy of the drawn cause-effect graph is improved.
And S1022, sequentially connecting the subsystems in the distributed system according to the occurrence sequence to obtain a causal graph.
The nodes of the causal graph are subsystems in a distributed system, and the connecting edge between two nodes in the causal graph is a message exchange event between the two subsystems, namely message transmission between the two subsystems.
And sequentially connecting the subsystems in the distributed system according to the occurrence sequence of the message exchange events among the subsystems to finally obtain the cause-effect diagram.
For example, as shown in fig. 3, a causal graph among A, B, C, D four subsystems is plotted, wherein A, B, C, D is respectively represented as four nodes in the causal graph. The numbers represent the sequence of messages between subsystems within the distributed system. For message "6", the sender's timestamp is b.lc4. Lc2, message "2" must precede message "6" because message "2" is assigned a timestamp of b.
S103, constructing a diagnosis tree based on the causal graph.
The diagnostic tree is constructed according to the causal graph, specifically according to the message exchange events among the nodes in the causal graph and the timestamps corresponding to the message exchange events.
Specifically, a root node is determined based on a causal graph, wherein the root node is a subsystem which finally receives message exchange event information sent by each subsystem; taking the node directly sending the message to the root node as the leaf node of the root node; and recursion is carried out in sequence until no message with the causal relationship exists in the causal graph, and the tree obtained at the moment is used as a diagnosis tree. Wherein the same subsystem may appear multiple times at different depths in the diagnostic tree, since the same subsystem may exchange messages with different subsystems at different times.
For example, taking the causal graph constructed in step S102 as an example, the root node of the diagnostic tree is D, and the node that directly sends the message to node D is located at depth 1. Recursively, the nodes at depth i are composed of all nodes sending messages to depth node (i-1), and the resulting diagnostic tree is shown in FIG. 4.
And S104, carrying out diagnostic test on the subsystems in the distributed system to obtain an estimation parameter.
Wherein the estimation parameters include a node reliability value, a connection reliability value, and an EMC (error masking capability) value. Wherein the EMC value is a preset value. Each subsystem corresponds to an EMC value. Specifically, each subsystem in the distributed system may be subjected to a diagnostic test, so as to obtain an estimated parameter corresponding to each subsystem. In particular implementations, a causal test may be used to perform diagnostic tests on subsystems in the distributed system.
In some embodiments, the method may further comprise: and storing the estimation parameters corresponding to each subsystem into an estimation parameter table so as to facilitate data calling when the alarm propagation path probability is calculated, and improve the calculation speed and efficiency.
In some embodiments, as shown in fig. 5, step S104 specifically includes step S1041 and step S1042.
S1041, testing the subsystems in the distributed system by using the plurality of test messages to obtain a plurality of test results.
Specifically, each subsystem in the distributed system may be tested separately using multiple test messages to obtain an estimated parameter of each subsystem separately. The test messages may be different types of test messages, for example, the test messages may be three types of test messages, i.e., sending out, receiving, or mixing. The content of the test message may also be different content.
In a specific implementation, the format of the test message may be represented as < Type > < State1> < Event1> < Count1> < State2> < Event2> < Count2>, where Type represents the message Type of the test message and < State1> < Event1> < Count1> represents that Event E is detected in State S for at least a Count of C times. < State1> < Event1> < Count1> indicates the actual message receiving or sending condition of the subsystem, and < State2> < Event2> < Count2> indicates the message receiving or sending condition when the subsystem is abnormal.
Comparing < State1> < Event1> < Count1> and < State2> < Event2> < Count2> of each test message to obtain a test result corresponding to each test message, wherein the test result comprises two conditions of no abnormality or abnormality of the subsystem.
S1042, counting the plurality of test results to calculate estimation parameters according to the statistical results.
Each test message corresponds to one test result, so that each test result is counted, and the total number of the test results (namely the total number of tests) and the test results are the number of the subsystems without abnormality are counted, so that the estimation parameters are respectively calculated according to the statistical results.
In a specific implementation process, the following formula can be adopted for calculating the node reliability value according to the statistical result:
Figure BDA0002551184270000061
wherein nr represents a node reliability value, | R | represents the total number of tests, and | R' | represents the number of test results with no exception in the subsystem. The larger the nr value, the more reliable the node is.
The following formula can be used for calculating the connection reliability value according to the statistical result:
Figure BDA0002551184270000071
wherein lr(i,j)Representing a node niAnd node njValue of reliability of connection between rIs connected withIndicating the number of test messages received by the node receiving the message, rHair-like deviceIndicating the number of test messages sent by the node sending the message. lr(i,j)The larger the value of (A), the node n is indicatediAnd node njThe more reliable the connection between them.
And S105, calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.
The nodes in the diagnostic tree represent subsystems in the distributed system, that is, the alarm propagation path in the distributed system may be a path from any node in the diagnostic tree to the root node in the diagnostic tree. And calculating the alarm propagation path of each subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out probability diagnosis on each subsystem in the distributed system.
In some embodiments, as shown in fig. 6, calculating the alarm propagation path probability of the subsystem specifically includes step S1051 and step S1052.
S1051, traversing each path of the connection root nodes in the diagnosis tree to obtain a plurality of paths to be tested.
Specifically, the diagnostic tree is traversed, so that a plurality of paths connecting the root nodes in the diagnostic tree are obtained, and the obtained paths are used as paths to be detected.
S1052, obtaining an estimation parameter corresponding to each node in the path to be measured, and respectively calculating the alarm propagation path probability corresponding to the path to be measured according to a preset calculation formula.
Specifically, the preset calculation formula may specifically be:
PPEP(n1,nk)=(1-nr(n1))·lr(1,2)·(1-emc(n2))…
·(1-nr(ni))·lr(i,i+1)·(1-emc(ni+1))…
·(1-nr(nk-2))·lr(k-2,k-1)·(1-emc(nk-1))·lr(k-1,k)
wherein PPEP (n)1,nk) Representing a node n1Probability of failure, resulting in the failure warning at n1To nkIs propagated on a path of (1), resulting in nkA failure also occurs. nr (n)i) Representing a node niThe value of the reliability of the node of (c),
Figure BDA0002551184270000072
representing a node niAnd node njValue of connection reliability between emc (n)i) Representing a node niNode EMC value of.
For each path to be detected, the estimation parameters of each node in the path to be detected are respectively obtained, then the alarm propagation path probability corresponding to the path to be detected is calculated by adopting a preset calculation formula, the path to be detected with higher probability of failure is screened out by utilizing the alarm propagation path probability, and then the nodes in the screened path to be detected are subjected to failure detection, so that the failure diagnosis speed is increased.
In the distributed system fault diagnosis method provided by the above embodiment, the abnormal information sent by the distributed system is received; acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events; constructing a diagnostic tree based on the causal graph; performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters; and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system. And calculating the alarm propagation path probability of each subsystem according to the constructed diagnosis tree, thereby realizing fault diagnosis of faults occurring in the distributed system and improving the speed of fault diagnosis.
Referring to fig. 7, fig. 7 is a schematic block diagram of a distributed system fault diagnosis apparatus according to an embodiment of the present application, where the distributed system fault diagnosis apparatus is configured to perform the foregoing distributed system fault diagnosis method. The distributed system fault diagnosis device may be configured in a server or a terminal.
The server may be an independent server or a server cluster. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device.
As shown in fig. 7, the distributed system fault diagnosis apparatus 200 includes: an anomaly receiving module 201, an event obtaining module 202, a diagnostic tree building module 203, a diagnostic testing module 204 and a probability calculating module 205.
An exception receiving module 201, configured to receive exception information sent by the distributed system.
An event obtaining module 202, configured to obtain, based on the system anomaly information, a message exchange event between subsystems in the distributed system, and draw a cause-and-effect graph based on the message exchange event.
The event obtaining module 202 includes an order determining sub-module 2021 and a causal graph sub-module 2022.
In particular, the order determination sub-module 2021 is configured to determine an occurrence order of the message exchange events. The cause and effect diagram sub-module 2022 is configured to sequentially connect the subsystems in the distributed system according to the occurrence order to obtain a cause and effect diagram.
A diagnostic tree construction module 203 for constructing a diagnostic tree based on the causal graph.
And the diagnostic test module 204 is configured to perform diagnostic tests on subsystems in the distributed system to obtain the estimated parameters.
The diagnostic test module 204 includes a test result sub-module 2041 and a result statistic sub-module 2042.
Specifically, the test result sub-module 2041 is configured to test the subsystems in the distributed system by using multiple test messages, so as to obtain multiple test results. The result statistic submodule 2042 is configured to perform statistics on the multiple test results, so as to calculate an estimation parameter according to the statistical result.
And a probability diagnosis module 205, configured to calculate an alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameter, so as to perform fault probability diagnosis on the distributed system.
The probabilistic diagnostic module 205 includes a path traversal sub-module 2051 and a probability computation sub-module 2052.
Specifically, the path traversing sub-module 2051 is configured to traverse each path connecting the root nodes in the diagnostic tree to obtain multiple paths to be measured. And the probability calculation submodule 2052 is configured to obtain an estimation parameter corresponding to each node in the path to be measured, and calculate the probability of the alarm propagation path corresponding to the path to be measured according to a preset calculation formula.
It should be noted that, as will be clearly understood by those skilled in the art, for convenience and brevity of description, the specific working processes of the distributed system fault diagnosis apparatus and each module described above may refer to the corresponding processes in the foregoing distributed system fault diagnosis method embodiment, and are not described herein again.
The distributed system fault diagnosis apparatus described above may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.
Referring to fig. 8, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of the distributed system fault diagnosis methods.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any one of the distributed system fault diagnosis methods.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
receiving abnormal information sent by the distributed system; acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events; constructing a diagnostic tree based on the causal graph; performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters; and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.
In one embodiment, the processor, in implementing the rendering of a cause and effect graph based on the message exchange events, is configured to implement:
determining an order of occurrence of the message exchange events; and sequentially connecting the subsystems in the distributed system according to the occurrence sequence to obtain a cause-and-effect graph, wherein the nodes of the cause-and-effect graph are the subsystems in the distributed system, and the connecting edge between two nodes in the cause-and-effect graph is a message exchange event between the two subsystems.
In one embodiment, the processor, when implementing the determining the order of occurrence of the message exchange events, is configured to implement:
acquiring an event timestamp of the message exchange event; and determining the occurrence sequence of the message exchange events according to the event time stamps.
In one embodiment, the processor, in performing the diagnostic test on the subsystems in the distributed system to obtain the estimated parameters, is configured to perform:
testing subsystems in the distributed system by using a plurality of test messages to obtain a plurality of test results; and counting a plurality of test results to calculate an estimation parameter according to the statistical result.
In one embodiment, the processor, in implementing the calculating of the alarm propagation path probability for the subsystem in the distributed system based on the diagnostic tree and the estimated parameters, is configured to implement:
traversing each path connecting root nodes in the diagnostic tree to obtain a plurality of paths to be tested; and acquiring an estimation parameter corresponding to each node in the path to be measured, and respectively calculating the alarm propagation path probability corresponding to the path to be measured according to a preset calculation formula.
In one embodiment, the estimated parameters include a node reliability value, a connection reliability value, and an EMC value, and the preset calculation formula is:
PPEP(n1,nk)=(1-nr(n1))·lr(1,2)·(1-emc(n2))…
·(1-nr(ni))·lr(i,i+1)·(1-emc(ni+1))…
·(1-nr(nk-2))·lr(k-2,k-1)·(1-emc(nk-1))·lr(k-1,k)
wherein PPEP (n)1,nk) Representing the alarm propagation path probability. nr (n)i) Representing a node niThe value of the reliability of the node of (c),
Figure BDA0002551184270000111
representing a node niAnd node njValue of connection reliability between emc (n)i) Representing a node niNode EMC value of.
The embodiment of the application further provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, the computer program comprises program instructions, and the processor executes the program instructions to implement any one of the distributed system fault diagnosis methods provided by the embodiment of the application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A distributed system fault diagnosis method is characterized by comprising the following steps:
receiving abnormal information sent by the distributed system;
acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events;
constructing a diagnostic tree based on the causal graph;
performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters;
and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.
2. The distributed system fault diagnosis method of claim 1, wherein said plotting a cause and effect graph based on said message exchange events comprises:
determining an order of occurrence of the message exchange events;
and sequentially connecting the subsystems in the distributed system according to the occurrence sequence to obtain a cause-and-effect graph, wherein the nodes of the cause-and-effect graph are the subsystems in the distributed system, and the connecting edge between two nodes in the cause-and-effect graph is a message exchange event between the two subsystems.
3. The distributed system fault diagnosis method according to claim 2, wherein the determining the occurrence order of the message exchange events comprises:
acquiring an event timestamp of the message exchange event;
and determining the occurrence sequence of the message exchange events according to the event time stamps.
4. The method of claim 1, wherein the performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters comprises:
testing subsystems in the distributed system by using a plurality of test messages to obtain a plurality of test results;
and counting a plurality of test results to calculate an estimation parameter according to the statistical result.
5. The distributed system fault diagnosis method of claim 1, wherein the calculating of the alarm propagation path probability for the sub-system in the distributed system based on the diagnostic tree and the estimated parameters comprises:
traversing each path connecting root nodes in the diagnostic tree to obtain a plurality of paths to be tested;
and acquiring an estimation parameter corresponding to each node in the path to be measured, and respectively calculating the alarm propagation path probability corresponding to the path to be measured according to a preset calculation formula.
6. The distributed system fault diagnosis method according to claim 5, wherein the estimated parameters include node reliability values, connection reliability values and EMC values, and the preset calculation formula is:
PPEP(n1,nk)=(1-nr(n1))·lr(1,2)·(1-emc(n2))...·(1-nr(ni))·lr(i,i+1)·(1-emc(ni+1))...·(1-nr(nk-2))·lr(k-2,k-1)·(1-emc(nk-1))·lr(k-1,k)
wherein PPEP (n)1,nk) Representing the alarm propagation path probability. n isr(ni) Representing a node niThe value of the reliability of the node of (c),
Figure FDA0002551184260000021
representing a node niAnd node njValue of connection reliability between emc (n)i) Representing a node niNode EMC value of.
7. The distributed system fault diagnosis method according to claim 1, wherein the method further comprises:
and storing the estimation parameters corresponding to each subsystem into an estimation parameter table.
8. A distributed system fault diagnosis apparatus, comprising:
the abnormity receiving module is used for receiving abnormity information sent by the distributed system;
the event acquisition module is used for acquiring message exchange events among subsystems in the distributed system based on the system abnormal information and drawing a cause-and-effect graph based on the message exchange events;
a diagnostic tree construction module for constructing a diagnostic tree based on the causal graph;
the diagnostic test module is used for carrying out diagnostic test on the subsystems in the distributed system to obtain estimation parameters;
and the probability diagnosis module is used for calculating the alarm propagation path probability of the sub-system in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.
9. A computer device, wherein the computer device comprises a memory and a processor;
the memory is used for storing a computer program;
the processor for executing the computer program and implementing the distributed system fault diagnosis method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the distributed system fault diagnosis method according to any one of claims 1 to 7.
CN202010575458.5A 2020-06-22 2020-06-22 Distributed system fault diagnosis method, device, equipment and storage medium Pending CN111796956A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010575458.5A CN111796956A (en) 2020-06-22 2020-06-22 Distributed system fault diagnosis method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010575458.5A CN111796956A (en) 2020-06-22 2020-06-22 Distributed system fault diagnosis method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111796956A true CN111796956A (en) 2020-10-20

Family

ID=72803635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010575458.5A Pending CN111796956A (en) 2020-06-22 2020-06-22 Distributed system fault diagnosis method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111796956A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365344A (en) * 2021-01-11 2021-02-12 支付宝(杭州)信息技术有限公司 Method and system for automatically generating business rules
CN113780614A (en) * 2021-01-04 2021-12-10 北京沃东天骏信息技术有限公司 Risk identification method and device
CN114200899A (en) * 2021-11-16 2022-03-18 中国航空工业集团公司雷华电子技术研究所 Multi-subsystem control method and system electronic equipment and readable storage medium thereof
CN114579380A (en) * 2022-03-02 2022-06-03 淮北仕欧网络科技有限公司 Artificial intelligence detection system and method for computer system fault
CN116048859A (en) * 2023-01-28 2023-05-02 金篆信科有限责任公司 Distributed database fault diagnosis method and device, electronic equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780614A (en) * 2021-01-04 2021-12-10 北京沃东天骏信息技术有限公司 Risk identification method and device
CN112365344A (en) * 2021-01-11 2021-02-12 支付宝(杭州)信息技术有限公司 Method and system for automatically generating business rules
CN112365344B (en) * 2021-01-11 2021-04-27 支付宝(杭州)信息技术有限公司 Method and system for automatically generating business rules
CN114200899A (en) * 2021-11-16 2022-03-18 中国航空工业集团公司雷华电子技术研究所 Multi-subsystem control method and system electronic equipment and readable storage medium thereof
CN114579380A (en) * 2022-03-02 2022-06-03 淮北仕欧网络科技有限公司 Artificial intelligence detection system and method for computer system fault
CN114579380B (en) * 2022-03-02 2023-10-17 浙江中国小商品城集团股份有限公司 Artificial intelligence detection system and method for computer system faults
CN116048859A (en) * 2023-01-28 2023-05-02 金篆信科有限责任公司 Distributed database fault diagnosis method and device, electronic equipment and storage medium
CN116048859B (en) * 2023-01-28 2023-08-25 金篆信科有限责任公司 Distributed database fault diagnosis method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111796956A (en) Distributed system fault diagnosis method, device, equipment and storage medium
US11321160B2 (en) In a microservices-based application, mapping distributed error stacks across multiple dimensions
US11010235B1 (en) Tracking error propagation across microservices based applications using distributed error stacks
US11755446B1 (en) Application topology graph for representing uninstrumented objects in a microservices-based architecture
US8230262B2 (en) Method and apparatus for dealing with accumulative behavior of some system observations in a time series for Bayesian inference with a static Bayesian network model
US8156377B2 (en) Method and apparatus for determining ranked causal paths for faults in a complex multi-host system with probabilistic inference in a time series
US8069370B1 (en) Fault identification of multi-host complex systems with timesliding window analysis in a time series
US8291263B2 (en) Methods and apparatus for cross-host diagnosis of complex multi-host systems in a time series with probabilistic inference
US7711987B2 (en) System and method for problem determination using dependency graphs and run-time behavior models
US8667334B2 (en) Problem isolation in a virtual environment
US11526425B1 (en) Generating metric data streams from spans ingested by a cloud deployment of an instrumentation analytics engine
US11429574B2 (en) Computer system diagnostic log chain
US20150046757A1 (en) Performance Metrics of a Computer System
JP7081741B2 (en) Methods and devices for determining the status of network devices
Kobayashi et al. Mining causes of network events in log data with causal inference
CN115118621B (en) Dependency graph-based micro-service performance diagnosis method and system
US11516269B1 (en) Application performance monitoring (APM) detectors for flagging application performance alerts
US9348721B2 (en) Diagnosing entities associated with software components
US8972789B2 (en) Diagnostic systems for distributed network
US11789804B1 (en) Identifying the root cause of failure observed in connection to a workflow
CN113454950A (en) Network equipment and link real-time fault detection method and system based on flow statistics
Kavulya et al. Draco: Top Down Statistical Diagnosis of Large-Scale VoIP Networks
AT&T
CN116057902A (en) Health index of service
US9311210B1 (en) Methods and apparatus for fault detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination