CN111796956A

CN111796956A - Distributed system fault diagnosis method, device, equipment and storage medium

Info

Publication number: CN111796956A
Application number: CN202010575458.5A
Authority: CN
Inventors: 刘利; 刘中原; 牛姣姣
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-10-20

Abstract

The application relates to the field of software monitoring, and particularly discloses a distributed system fault diagnosis method, a device, equipment and a storage medium, wherein the method comprises the following steps: receiving abnormal information sent by the distributed system; acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events; constructing a diagnostic tree based on the causal graph; performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters; and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system. The method and the device have the advantages that fault diagnosis is carried out on faults occurring in the distributed system, the source of the faults is identified, and the fault detection efficiency is improved.

Description

Distributed system fault diagnosis method, device, equipment and storage medium

Technical Field

The present application relates to the field of fault diagnosis, and in particular, to a distributed system fault diagnosis method, apparatus, device, and storage medium.

Background

Currently, with the continued development of distributed computer infrastructure, the distributed multi-tier application is relied upon behind the ATM network or e-commerce web site, and the consequences of distributed system downtime can be catastrophic.

However, due to the complex structure of the distributed system, the propagation speed of the error-reporting alarm is high, and many application programs are reasons such as black boxes and high delay for the diagnostician, the fault of a certain subsystem in the system may cause multiple monitoring index abnormalities and a large number of test failures, and it is often difficult to quickly locate the fault and determine the fault reason.

Therefore, how to diagnose faults occurring in the distributed system and identify the sources of the faults become an urgent problem to be solved.

Disclosure of Invention

The application provides a distributed system fault diagnosis method, a distributed system fault diagnosis device and a storage medium, which are used for carrying out fault diagnosis on faults occurring in a distributed system and identifying the sources of the faults.

In a first aspect, the present application provides a distributed system fault diagnosis method, including:

receiving abnormal information sent by the distributed system;

acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events;

constructing a diagnostic tree based on the causal graph;

performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters;

and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.

In a second aspect, the present application further provides a distributed system fault diagnosis apparatus, including:

the abnormity receiving module is used for receiving abnormity information sent by the distributed system;

the event acquisition module is used for acquiring message exchange events among subsystems in the distributed system based on the system abnormal information and drawing a cause-and-effect graph based on the message exchange events;

a diagnostic tree construction module for constructing a diagnostic tree based on the causal graph;

the diagnostic test module is used for carrying out diagnostic test on the subsystems in the distributed system to obtain estimation parameters;

and the probability diagnosis module is used for calculating the alarm propagation path probability of the sub-system in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.

In a third aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is used for executing the computer program and realizing the distributed system fault diagnosis method when the computer program is executed.

In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the distributed system fault diagnosis method as described above.

The application discloses a distributed system fault diagnosis method, a device, equipment and a storage medium, wherein abnormal information sent by a distributed system is received; acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events; constructing a diagnostic tree based on the causal graph; performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters; and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system. And calculating the alarm propagation path probability of each subsystem according to the constructed diagnosis tree, thereby realizing fault diagnosis of faults occurring in the distributed system and improving the speed of fault diagnosis.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a distributed system fault diagnosis method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating steps for creating a cause and effect graph based on the message exchange events according to an embodiment of the present application;

FIG. 3 is a causal diagram illustration of a subsystem in a distributed system as provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a diagnostic tree of a subsystem in a distributed system according to an embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating sub-steps of a distributed system fault diagnosis method provided by an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating steps for calculating an alarm propagation path probability of a sub-system according to an embodiment of the present application;

fig. 7 is a schematic block diagram of a distributed system fault diagnosis apparatus according to an embodiment of the present application;

fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a distributed system fault diagnosis method and device, computer equipment and a storage medium. The distributed system fault diagnosis method can be used for carrying out fault diagnosis on the distributed system, and the fault diagnosis speed is improved.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flowchart of a distributed system fault diagnosis method according to an embodiment of the present application. The fault diagnosis method of the distributed system realizes fault probability detection in the distributed system by constructing a diagnosis tree and calculating the probability of a propagation path.

As shown in fig. 1, the distributed system fault diagnosis method specifically includes: step S101 to step S105.

S101, receiving abnormal information sent by the distributed system.

The monitor is used for monitoring the running condition of the distributed system, and when a subsystem or middleware of the distributed system fails, an abnormal message is thrown out; or when the upstream system is down, the downstream system can not receive the message so as to throw the abnormal message. The monitor is used for receiving messages thrown by each subsystem. When an abnormal message identification is found in the message, the occurrence of the fault is detected.

In particular, the monitor may forward messages by modifying a middleware layer in the distributed system or obtain protocol messages by a passive listening mechanism of the monitor.

S102, acquiring a message exchange event among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange event.

Specifically, the monitor acquires message exchange events among the subsystems in the distributed system, and draws a cause-and-effect graph according to the acquired message exchange events.

In some embodiments, as shown in fig. 2, the drawing of the cause-and-effect graph based on the message exchange event specifically includes step S1021 and step S1022.

S1021, determining the occurrence sequence of the message exchange events.

Since the communication connections are asynchronous, the order in which message exchange events received by the monitor occur may differ from the actual order in which they occur in the subsystems within the distributed system, and therefore, the actual order in which the message exchange events occur needs to be determined.

In some embodiments, the determining the order of occurrence of the message exchange events comprises: acquiring an event timestamp of the message exchange event; and determining the occurrence sequence of the message exchange events according to the event time stamps.

Specifically, for two different subsystems in the distributed system, the order in which the monitor acquires the message exchange events sent by the two subsystems may be different from the order in which the subsystems send, and therefore, in order to obtain the correct order of the message exchange events sent by the multiple systems and thus obtain the correct cause-and-effect relationship, so as to draw a cause-and-effect graph, the time stamp information of the distributed system itself may be attached when the subsystems send the messages. So that the monitor can determine the true front-to-back order from the timestamp information (i.e., logic clock LC) attached to the message.

For example, for three different subsystems a, B and D. A sends a message to D first, B sends a message to D later. Because the messages sent by a and B arrive at the monitor via different paths, the time recorded by the monitor when the message exchange event is received may be different from the time when a and B send the messages themselves, and the front-back order relationship of message sending may also change. And acquiring the sequence of the message exchange events among different subsystems through the time stamps, so that the cause-effect graph is drawn according to the sequence, and the accuracy of the drawn cause-effect graph is improved.

And S1022, sequentially connecting the subsystems in the distributed system according to the occurrence sequence to obtain a causal graph.

The nodes of the causal graph are subsystems in a distributed system, and the connecting edge between two nodes in the causal graph is a message exchange event between the two subsystems, namely message transmission between the two subsystems.

And sequentially connecting the subsystems in the distributed system according to the occurrence sequence of the message exchange events among the subsystems to finally obtain the cause-effect diagram.

For example, as shown in fig. 3, a causal graph among A, B, C, D four subsystems is plotted, wherein A, B, C, D is respectively represented as four nodes in the causal graph. The numbers represent the sequence of messages between subsystems within the distributed system. For message "6", the sender's timestamp is b.lc4. Lc2, message "2" must precede message "6" because message "2" is assigned a timestamp of b.

S103, constructing a diagnosis tree based on the causal graph.

The diagnostic tree is constructed according to the causal graph, specifically according to the message exchange events among the nodes in the causal graph and the timestamps corresponding to the message exchange events.

Specifically, a root node is determined based on a causal graph, wherein the root node is a subsystem which finally receives message exchange event information sent by each subsystem; taking the node directly sending the message to the root node as the leaf node of the root node; and recursion is carried out in sequence until no message with the causal relationship exists in the causal graph, and the tree obtained at the moment is used as a diagnosis tree. Wherein the same subsystem may appear multiple times at different depths in the diagnostic tree, since the same subsystem may exchange messages with different subsystems at different times.

For example, taking the causal graph constructed in step S102 as an example, the root node of the diagnostic tree is D, and the node that directly sends the message to node D is located at depth 1. Recursively, the nodes at depth i are composed of all nodes sending messages to depth node (i-1), and the resulting diagnostic tree is shown in FIG. 4.

And S104, carrying out diagnostic test on the subsystems in the distributed system to obtain an estimation parameter.

Wherein the estimation parameters include a node reliability value, a connection reliability value, and an EMC (error masking capability) value. Wherein the EMC value is a preset value. Each subsystem corresponds to an EMC value. Specifically, each subsystem in the distributed system may be subjected to a diagnostic test, so as to obtain an estimated parameter corresponding to each subsystem. In particular implementations, a causal test may be used to perform diagnostic tests on subsystems in the distributed system.

In some embodiments, the method may further comprise: and storing the estimation parameters corresponding to each subsystem into an estimation parameter table so as to facilitate data calling when the alarm propagation path probability is calculated, and improve the calculation speed and efficiency.

In some embodiments, as shown in fig. 5, step S104 specifically includes step S1041 and step S1042.

S1041, testing the subsystems in the distributed system by using the plurality of test messages to obtain a plurality of test results.

Specifically, each subsystem in the distributed system may be tested separately using multiple test messages to obtain an estimated parameter of each subsystem separately. The test messages may be different types of test messages, for example, the test messages may be three types of test messages, i.e., sending out, receiving, or mixing. The content of the test message may also be different content.

In a specific implementation, the format of the test message may be represented as < Type > < State1> < Event1> < Count1> < State2> < Event2> < Count2>, where Type represents the message Type of the test message and < State1> < Event1> < Count1> represents that Event E is detected in State S for at least a Count of C times. < State1> < Event1> < Count1> indicates the actual message receiving or sending condition of the subsystem, and < State2> < Event2> < Count2> indicates the message receiving or sending condition when the subsystem is abnormal.

Comparing < State1> < Event1> < Count1> and < State2> < Event2> < Count2> of each test message to obtain a test result corresponding to each test message, wherein the test result comprises two conditions of no abnormality or abnormality of the subsystem.

S1042, counting the plurality of test results to calculate estimation parameters according to the statistical results.

Each test message corresponds to one test result, so that each test result is counted, and the total number of the test results (namely the total number of tests) and the test results are the number of the subsystems without abnormality are counted, so that the estimation parameters are respectively calculated according to the statistical results.

In a specific implementation process, the following formula can be adopted for calculating the node reliability value according to the statistical result:

wherein nr represents a node reliability value, | R | represents the total number of tests, and | R' | represents the number of test results with no exception in the subsystem. The larger the nr value, the more reliable the node is.

The following formula can be used for calculating the connection reliability value according to the statistical result:

wherein l_r(i,j)Representing a node n_iAnd node n_jValue of reliability of connection between r_{Is connected with}Indicating the number of test messages received by the node receiving the message, r_{Hair-like device}Indicating the number of test messages sent by the node sending the message. l_r(i,j)The larger the value of (A), the node n is indicated_iAnd node n_jThe more reliable the connection between them.

And S105, calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.

The nodes in the diagnostic tree represent subsystems in the distributed system, that is, the alarm propagation path in the distributed system may be a path from any node in the diagnostic tree to the root node in the diagnostic tree. And calculating the alarm propagation path of each subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out probability diagnosis on each subsystem in the distributed system.

In some embodiments, as shown in fig. 6, calculating the alarm propagation path probability of the subsystem specifically includes step S1051 and step S1052.

S1051, traversing each path of the connection root nodes in the diagnosis tree to obtain a plurality of paths to be tested.

Specifically, the diagnostic tree is traversed, so that a plurality of paths connecting the root nodes in the diagnostic tree are obtained, and the obtained paths are used as paths to be detected.

S1052, obtaining an estimation parameter corresponding to each node in the path to be measured, and respectively calculating the alarm propagation path probability corresponding to the path to be measured according to a preset calculation formula.

Specifically, the preset calculation formula may specifically be:

PPEP(n₁,n_k)＝(1-nr(n₁))·l_r(1,2)·(1-emc(n₂))…

·(1-nr(n_i))·l_r(i,i+1)·(1-emc(n_i+1))…

·(1-nr(n_k-2))·l_r(k-2,k-1)·(1-emc(n_k-1))·l_r(k-1,k)

wherein PPEP (n)₁,n_k) Representing a node n₁Probability of failure, resulting in the failure warning at n₁To n_kIs propagated on a path of (1), resulting in n_kA failure also occurs. nr (n)_i) Representing a node n_iThe value of the reliability of the node of (c),

representing a node n_iAnd node n_jValue of connection reliability between emc (n)_i) Representing a node n_iNode EMC value of.

For each path to be detected, the estimation parameters of each node in the path to be detected are respectively obtained, then the alarm propagation path probability corresponding to the path to be detected is calculated by adopting a preset calculation formula, the path to be detected with higher probability of failure is screened out by utilizing the alarm propagation path probability, and then the nodes in the screened path to be detected are subjected to failure detection, so that the failure diagnosis speed is increased.

In the distributed system fault diagnosis method provided by the above embodiment, the abnormal information sent by the distributed system is received; acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events; constructing a diagnostic tree based on the causal graph; performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters; and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system. And calculating the alarm propagation path probability of each subsystem according to the constructed diagnosis tree, thereby realizing fault diagnosis of faults occurring in the distributed system and improving the speed of fault diagnosis.

Referring to fig. 7, fig. 7 is a schematic block diagram of a distributed system fault diagnosis apparatus according to an embodiment of the present application, where the distributed system fault diagnosis apparatus is configured to perform the foregoing distributed system fault diagnosis method. The distributed system fault diagnosis device may be configured in a server or a terminal.

The server may be an independent server or a server cluster. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device.

As shown in fig. 7, the distributed system fault diagnosis apparatus 200 includes: an anomaly receiving module 201, an event obtaining module 202, a diagnostic tree building module 203, a diagnostic testing module 204 and a probability calculating module 205.

An exception receiving module 201, configured to receive exception information sent by the distributed system.

An event obtaining module 202, configured to obtain, based on the system anomaly information, a message exchange event between subsystems in the distributed system, and draw a cause-and-effect graph based on the message exchange event.

The event obtaining module 202 includes an order determining sub-module 2021 and a causal graph sub-module 2022.

In particular, the order determination sub-module 2021 is configured to determine an occurrence order of the message exchange events. The cause and effect diagram sub-module 2022 is configured to sequentially connect the subsystems in the distributed system according to the occurrence order to obtain a cause and effect diagram.

A diagnostic tree construction module 203 for constructing a diagnostic tree based on the causal graph.

And the diagnostic test module 204 is configured to perform diagnostic tests on subsystems in the distributed system to obtain the estimated parameters.

The diagnostic test module 204 includes a test result sub-module 2041 and a result statistic sub-module 2042.

Specifically, the test result sub-module 2041 is configured to test the subsystems in the distributed system by using multiple test messages, so as to obtain multiple test results. The result statistic submodule 2042 is configured to perform statistics on the multiple test results, so as to calculate an estimation parameter according to the statistical result.

And a probability diagnosis module 205, configured to calculate an alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameter, so as to perform fault probability diagnosis on the distributed system.

The probabilistic diagnostic module 205 includes a path traversal sub-module 2051 and a probability computation sub-module 2052.

Specifically, the path traversing sub-module 2051 is configured to traverse each path connecting the root nodes in the diagnostic tree to obtain multiple paths to be measured. And the probability calculation submodule 2052 is configured to obtain an estimation parameter corresponding to each node in the path to be measured, and calculate the probability of the alarm propagation path corresponding to the path to be measured according to a preset calculation formula.

It should be noted that, as will be clearly understood by those skilled in the art, for convenience and brevity of description, the specific working processes of the distributed system fault diagnosis apparatus and each module described above may refer to the corresponding processes in the foregoing distributed system fault diagnosis method embodiment, and are not described herein again.

The distributed system fault diagnosis apparatus described above may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 8.

Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

Referring to fig. 8, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of the distributed system fault diagnosis methods.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any one of the distributed system fault diagnosis methods.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

receiving abnormal information sent by the distributed system; acquiring message exchange events among subsystems in the distributed system based on the system abnormal information, and drawing a cause-and-effect graph based on the message exchange events; constructing a diagnostic tree based on the causal graph; performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters; and calculating the alarm propagation path probability of the subsystem in the distributed system based on the diagnosis tree and the estimation parameters so as to carry out fault probability diagnosis on the distributed system.

In one embodiment, the processor, in implementing the rendering of a cause and effect graph based on the message exchange events, is configured to implement:

determining an order of occurrence of the message exchange events; and sequentially connecting the subsystems in the distributed system according to the occurrence sequence to obtain a cause-and-effect graph, wherein the nodes of the cause-and-effect graph are the subsystems in the distributed system, and the connecting edge between two nodes in the cause-and-effect graph is a message exchange event between the two subsystems.

In one embodiment, the processor, when implementing the determining the order of occurrence of the message exchange events, is configured to implement:

acquiring an event timestamp of the message exchange event; and determining the occurrence sequence of the message exchange events according to the event time stamps.

In one embodiment, the processor, in performing the diagnostic test on the subsystems in the distributed system to obtain the estimated parameters, is configured to perform:

testing subsystems in the distributed system by using a plurality of test messages to obtain a plurality of test results; and counting a plurality of test results to calculate an estimation parameter according to the statistical result.

In one embodiment, the processor, in implementing the calculating of the alarm propagation path probability for the subsystem in the distributed system based on the diagnostic tree and the estimated parameters, is configured to implement:

traversing each path connecting root nodes in the diagnostic tree to obtain a plurality of paths to be tested; and acquiring an estimation parameter corresponding to each node in the path to be measured, and respectively calculating the alarm propagation path probability corresponding to the path to be measured according to a preset calculation formula.

In one embodiment, the estimated parameters include a node reliability value, a connection reliability value, and an EMC value, and the preset calculation formula is:

PPEP(n₁,n_k)＝(1-nr(n₁))·l_r(1,2)·(1-emc(n₂))…

·(1-nr(n_i))·l_r(i,i+1)·(1-emc(n_i+1))…

·(1-nr(n_k-2))·l_r(k-2,k-1)·(1-emc(n_k-1))·l_r(k-1,k)

wherein PPEP (n)₁,n_k) Representing the alarm propagation path probability. nr (n)_i) Representing a node n_iThe value of the reliability of the node of (c),

The embodiment of the application further provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, the computer program comprises program instructions, and the processor executes the program instructions to implement any one of the distributed system fault diagnosis methods provided by the embodiment of the application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A distributed system fault diagnosis method is characterized by comprising the following steps:

receiving abnormal information sent by the distributed system;

constructing a diagnostic tree based on the causal graph;

2. The distributed system fault diagnosis method of claim 1, wherein said plotting a cause and effect graph based on said message exchange events comprises:

determining an order of occurrence of the message exchange events;

and sequentially connecting the subsystems in the distributed system according to the occurrence sequence to obtain a cause-and-effect graph, wherein the nodes of the cause-and-effect graph are the subsystems in the distributed system, and the connecting edge between two nodes in the cause-and-effect graph is a message exchange event between the two subsystems.

3. The distributed system fault diagnosis method according to claim 2, wherein the determining the occurrence order of the message exchange events comprises:

acquiring an event timestamp of the message exchange event;

and determining the occurrence sequence of the message exchange events according to the event time stamps.

4. The method of claim 1, wherein the performing diagnostic tests on subsystems in the distributed system to obtain estimated parameters comprises:

testing subsystems in the distributed system by using a plurality of test messages to obtain a plurality of test results;

and counting a plurality of test results to calculate an estimation parameter according to the statistical result.

5. The distributed system fault diagnosis method of claim 1, wherein the calculating of the alarm propagation path probability for the sub-system in the distributed system based on the diagnostic tree and the estimated parameters comprises:

traversing each path connecting root nodes in the diagnostic tree to obtain a plurality of paths to be tested;

and acquiring an estimation parameter corresponding to each node in the path to be measured, and respectively calculating the alarm propagation path probability corresponding to the path to be measured according to a preset calculation formula.

6. The distributed system fault diagnosis method according to claim 5, wherein the estimated parameters include node reliability values, connection reliability values and EMC values, and the preset calculation formula is:

PPEP(n₁，n_k)＝(1-nr(n₁))·l_r(1，2)·(1-emc(n₂))...·(1-nr(n_i))·l_r(i，i+1)·(1-emc(n_i+1))...·(1-nr(n_k-2))·l_r(k-2，k-1)·(1-emc(n_k-1))·l_r(k-1，k)

wherein PPEP (n)₁，n_k) Representing the alarm propagation path probability. n is_r(n_i) Representing a node n_iThe value of the reliability of the node of (c),

7. The distributed system fault diagnosis method according to claim 1, wherein the method further comprises:

and storing the estimation parameters corresponding to each subsystem into an estimation parameter table.

8. A distributed system fault diagnosis apparatus, comprising:

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory is used for storing a computer program;

the processor for executing the computer program and implementing the distributed system fault diagnosis method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the distributed system fault diagnosis method according to any one of claims 1 to 7.