CN109474445B

CN109474445B - Distributed system root fault positioning method and device

Info

Publication number: CN109474445B
Application number: CN201710801677.9A
Authority: CN
Inventors: 赵丽; 王泽�; 郭三川; 柳哲; 何慧虹; 徐太忠; 潘欣雨
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2017-09-07
Filing date: 2017-09-07
Publication date: 2022-08-19
Anticipated expiration: 2037-09-07
Also published as: CN109474445A

Abstract

The invention relates to a method and a device for positioning root cause faults of a distributed system, wherein the method comprises the following steps: acquiring test data of a distributed system, and selecting test data which is unavailable to a tested service in the test data; constructing a fault diagnosis graph by using test data which are unavailable to the tested service in the test data, and acquiring a complete subgraph in the fault diagnosis graph; according to the technical scheme provided by the invention, the test data provided by the invention is utilized to construct the fault diagnosis graph, the faults of the distributed system are quickly and accurately diagnosed according to the fault diagnosis graph, and troubles and economic losses caused by system failure are effectively prevented.

Description

Distributed system root fault positioning method and device

Technical Field

The invention relates to the technical field of data mining, in particular to a method and a device for locating root cause faults of a distributed system.

Background

With the rapid development of internet technologies such as cloud computing and big data, the scale of a distributed information system of a large enterprise is more and more huge. For example, by 2014 amazon has built 11 cloud zones, 28 data center groups, 200 ten thousand servers. By 2014, Google has owned 100 million servers in global data centers. Once a large number of business systems which are closely related to economic production and social life break down and suffer from network attack, great inconvenience and economic loss are brought to the production and life of the society, and even serious social security events are caused. For example, by 27 months 5 in 2015, a large area of paralysis occurs with a paupo having nearly 3 billion active users; the distance network can not be accessed for 12 hours in 2015, 5 months and 28 days, and the economic loss is about 5000 ten thousand yuan RMB. Therefore, in large-scale information systems, it is important to quickly and efficiently perform automatic fault detection and diagnosis in both research and practice.

The existing methods for analyzing the root cause fault of the distributed system comprise a graph-based method, an expert system-based method, an analysis model and a data driving method.

The fault diagnosis method based on graph theory mainly includes a Signed Directed Graph (SDG) method and a fault tree method. The SDG is a graphical model for describing the causal relationship of the system, nodes represent events or variables, and directed edges represent the causal relationship among the variables. And when a fault occurs, judging the cause of the fault by combining a certain search strategy according to the cause-and-effect relationship among the node changes. The fault tree is a logical diagram from effect to cause, and based on the fault state of the system, the fault tree performs inference step by step to determine the basic cause, the influence degree and the occurrence probability of the fault.

The fault diagnosis method based on the expert system utilizes the practical experience of field experts to establish a knowledge base and carries out reasoning and decision process to carry out fault diagnosis. Expert knowledge is often expressed in deterministic IF-THEN rules. Because the expert knowledge has uncertainty, the fuzzy expert system introduces the concept of fuzzy membership and utilizes fuzzy logic to carry out reasoning, so that the uncertainty in the expert knowledge can be well processed.

The fault diagnosis method based on the analytical model describes expected behaviors of the system by using an accurate mathematical model of the system, and carries out fault diagnosis by comparing whether actual input and output quantities are consistent or not, and mainly comprises a state estimation method, a parameter estimation method and an equivalent space-based method. The state estimation-based method mainly comprises a filter method and an observer method, and fault diagnosis is performed by using methods such as hypothesis testing and the like. The parameter estimation-based method considers that the fault causes a change in the parameters of the model, and performs fault diagnosis by detecting the parameter change in the model.

The data-driven fault diagnosis method carries out fault diagnosis by analyzing and processing system operation data without knowing a system accurate analysis model, and is mainly divided into a machine learning method, a multivariate statistical analysis method and the like. The machine learning method is used for fault diagnosis by training a neural network or a support vector machine and other machine learning algorithms by utilizing historical data of the system under normal and various fault conditions. The multivariate statistical analysis-based method is used for fault diagnosis by utilizing the correlation among a plurality of variables, and mainly comprises a PCA (principal component analysis) method. The PCA method projects a sample matrix of the variables to a low-dimensional space, reflects the main changes of the process variables, and carries out fault diagnosis through given detection indexes.

The existing fault diagnosis method for the distributed information system still has great limitation, and mainly has the following two reasons: on one hand, a distributed system fault diagnosis method for availability test failure is not intuitive enough, and in practice, as distributed information systems are more and more complex and fault points are difficult to detect comprehensively, faults are difficult to diagnose accurately, and more intuitive analysis and diagnosis tools are needed; on the other hand, the existing fault diagnosis method is difficult to solve the problem of real-time unsupervised fault diagnosis. The graph theory method, the expert system method and the data-driven method require system structure knowledge or training data, the analysis-based method is difficult to model a complex system, and the data or knowledge is difficult to obtain in practice, and meanwhile, the methods do not utilize information of system service availability test, and the information has important value for effective fault diagnosis.

Disclosure of Invention

The invention provides a method and a device for positioning root cause faults of a distributed system, and aims to construct a fault diagnosis diagram by using test data provided by the invention, quickly and accurately diagnose the faults of the distributed system according to the fault diagnosis diagram, and effectively prevent troubles and economic losses caused by system failure.

The purpose of the invention is realized by adopting the following technical scheme:

the improvement of a distributed system root cause fault positioning method is that the method comprises the following steps:

acquiring test data of a distributed system, and selecting test data which is unavailable to a tested service in the test data;

constructing a fault diagnosis graph by using test data which are unavailable to the tested service in the test data, and acquiring a complete subgraph in the fault diagnosis graph;

and carrying out fault positioning according to the complete subgraph.

Preferably, the distributed system test data includes: the test condition attribute and the corresponding test result attribute thereof, wherein the test condition attribute comprises: external attributes and their corresponding internal attributes, the external attributes including: testing address, operator and tested service, the internal attribute corresponding to the external attribute includes: the method comprises the steps of obtaining a network device state, an operating system state and an application program state, wherein internal attributes corresponding to the external attributes are obtained by a data flow tracking method according to the external attributes.

Further, the selecting test data, in the test data, for which the service under test is unavailable includes:

and if the attribute value of the test result attribute corresponding to the test condition attribute of the test data is unavailable, selecting the test data.

Preferably, the constructing a fault diagnosis graph by using the test data of the test data, in which the service to be tested is unavailable, and acquiring a complete sub-graph in the fault diagnosis graph includes:

taking test data in the test data, of which the tested service is unavailable, as nodes, and if attribute values corresponding to attributes in the test data, of which the tested service is unavailable, in the test data among the nodes are the same, connecting the attribute values corresponding to the attributes among the nodes to obtain a fault diagnosis graph;

and extracting a connection graph between single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph.

Preferably, the performing fault location according to the complete subgraph includes:

verifying the reliability of the complete subgraph and acquiring the complete subgraph meeting the reliability requirement;

and carrying out fault positioning according to the number of the nodes of the complete subgraph.

Further, the verifying the reliability of the complete sub-graph and obtaining the complete sub-graph meeting the reliability requirement includes:

f test is carried out on the kth complete subgraph, and the test value F of the kth complete subgraph is determined according to the formula _k ：

F _k ＝(SSA _k /f _SSAk )/(SSE _k /f _SSEk )

In the above formula, SSA _k The sum of squares, f, within the group for the kth complete subgraph _SSAk Is SSA _k Degree of freedom of (SSE) _k Is the sum of squared values between groups of the kth complete subgraph, f _SSEk Is SSE _k The degree of freedom of (c);

if the check value F of the kth complete subgraph _k If the k-th complete subgraph is larger than the F check value threshold, the reliability requirement is met by the k-th complete subgraph.

Further, the Sum of Squares (SSA) in the group of the kth complete subgraph is determined as follows _k ：

Determining the sum of squared between groups SSE of the kth complete subgraph as follows _k ：

n _k The number of test data having the same attribute value as that of the attribute corresponding to the kth full sub-graph,

the occupancy that is unavailable is the attribute value of the test result of the test data that is the same as the attribute value of the attribute corresponding to the kth full subgraph,

for the occupancy in the test data where the attribute value of the test result is unavailable,

for the number of test data different in attribute value of the attribute corresponding to the kth full sub-graph,

occupancy rate of unavailable for attribute value of test result of test data different from attribute value of attribute corresponding to kth full subgraph, x _kMi Is an attribute value coefficient of a test result of test data having the same attribute value as that of the attribute corresponding to the kth full subgraph,

the attribute value coefficient of the test result of the test data with different attribute values corresponding to the kth complete subgraph is set as 0 when the attribute value of the test result of the test data is available, and the attribute value coefficient of the test result of the test data is set as 1 when the attribute value of the test result of the test data is unavailable.

Further, the performing fault location according to the number of nodes in the complete subgraph includes:

acquiring complete subgraph C with maximum number of nodes in complete subgraph meeting reliability requirement ₁ And C is ₁ Adding the attribute value of the corresponding attribute into a fault set;

comparing the xth complete sub-graph C of the complete sub-graphs satisfying the reliability requirement _x Node of (2) and said C ₁ If the coincidence rate of the node (C) is greater than the correlation threshold value, C is set _x Adding the attribute value of the corresponding attribute into the fault set, wherein X belongs to [1, X ∈ [ ]]X is the total number of complete subgraphs meeting the reliability requirement;

and taking the attribute value in the fault set as a fault reason.

In a distributed system root cause fault location apparatus, the improvement comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring test data of a distributed system and selecting the test data with unavailable tested service in the test data;

the building unit is used for building a fault diagnosis graph by using test data of which the tested service is unavailable in the test data and acquiring a complete subgraph in the fault diagnosis graph;

and the fault positioning unit is used for carrying out fault positioning according to the complete subgraph.

The invention has the beneficial effects that:

the technical scheme provided by the invention utilizes the test data which is unavailable in the tested service in the test data provided by the invention to construct the fault diagnosis graph, and acquiring a complete subgraph in the fault diagnosis graph, finally performing fault location according to the complete subgraph, describing and analyzing a test sample by adopting a multi-relation graph, being capable of visually representing the incidence relation among failure records in the current system, and better reflecting the direct and exact incidence relation of the failure nodes in the fault diagnosis by the multi-relation graph compared with a common weighted graph, by fault location of the related clusters, the attributes and attribute values related to failure can be clustered at the same time, and the failure records can be clustered, in many cases, the attribute values can clearly reflect the root cause of the fault, and the clustered results can roughly position the position of the root cause of the fault in the rest cases;

furthermore, in the fault positioning process, the reliability of the complete subgraph is judged by adopting F test, the condition that a larger complete subgraph is easy to form due to less values of attribute values of certain attributes is filtered, and the complete subgraph really related to the fault reason can be screened out.

Drawings

FIG. 1 is a flow chart of a distributed system root cause fault location method of the present invention;

fig. 2 is a schematic structural diagram of the root cause fault locating device of the distributed system of the invention.

Detailed Description

The following detailed description of the embodiments of the invention refers to the accompanying drawings.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a distributed system root fault positioning method, as shown in fig. 1, including:

101. acquiring test data of a distributed system, and selecting test data which is unavailable to a tested service in the test data;

102. constructing a fault diagnosis graph by using test data of which the tested service is unavailable in the test data, and acquiring a complete subgraph in the fault diagnosis graph;

103. and carrying out fault positioning according to the complete subgraph.

Wherein, the distributed system test data comprises: the test condition attribute and the corresponding test result attribute thereof, wherein the test condition attribute comprises: external attributes and their corresponding internal attributes, the external attributes including: testing address, operator and tested service, the internal attribute corresponding to the external attribute includes: the method comprises the steps of obtaining a network device state, an operating system state and an application program state, wherein internal attributes corresponding to the external attributes are obtained by a data flow tracking method according to the external attributes.

The test result attribute is a use result of a tested service which simulates a distributed system to use, if the tested service is available, the attribute value of the test result attribute is available, and if the tested service is unavailable, the attribute value of the test result attribute is unavailable.

Further, in the step 101, a specific process of selecting test data in the test data, where the service under test is unavailable, includes:

For example: acquiring test data of a distributed system, wherein the test data comprises test condition attributes and test result attributes corresponding to the test condition attributes, and the test condition attributes comprise: and the external test data and the internal monitoring data are correlated. The attributes of the external test data comprise a test address, an operator and a tested service; the internal monitoring data comprises a network equipment state, an operating system state and an application program state; let x _i ＝[x _i1 ,x _i2 ,…,x _iM ]Represents a complete usability test record, where x _ik Representing the kth attribute value of the ith test record, and M represents the number of attributes; the data set for all test records is X ═ X _i 1, …, | X |, where | X | represents the number of records in X, an example test data set is shown in table 1:

table 1 test data set example

The data in table 1 represents 16 original test records, that is, | X | ═ 16, and each piece of original data has 7 test condition attributes, including: the method comprises the following steps that 1 test result attribute comprises a test place, an operator, a tested service, a system region, network equipment, a server and a CPU occupancy rate, wherein the attribute value of 1 test result attribute is 1 or 0, when the tested service is available, the attribute value of 0 is 0, and when the tested service is unavailable, the attribute value of 1 is 1;

after the test data of the test data, in which the tested service is unavailable, is obtained, a fault diagnosis graph needs to be constructed by using the test data of the test data, in which the tested service is unavailable, and a complete sub-graph in the fault diagnosis graph is obtained, so that the step 102 includes:

and extracting a connection graph among single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph.

For example, as shown in table 1, the test data in which the service to be tested is unavailable in the test data is taken as nodes, which includes 1 st, 3 rd, 4 th and 5 th test data, which represent 4 nodes, the first attribute of the nodes has 4 values, each of which forms a complete subgraph with a node number of 1, the second attribute has 1 value, which forms a complete subgraph with a node number of 4, the third attribute has 2 values, which forms a complete subgraph with a node number of 3 and a complete subgraph with a node number of 1, the fourth attribute has 1 value, which forms a complete subgraph with a node number of 4, the fifth attribute has 1 value, which forms a complete subgraph with a node number of 4, and the sixth attribute has 2 values, which forms a complete subgraph with a node number of 3 and a complete subgraph with a node number of 1, for the seventh attribute, the nodes have 3 values, and a complete subgraph with 2 nodes and two complete subgraphs with 1 nodes are formed. Thus, there are actually 14 complete subgraphs for the set of records that are continuously failing in this example.

Further, fault location needs to be performed according to the complete subgraph, so that the step 103 includes:

Specifically, the verifying the reliability of the complete subgraph and acquiring the complete subgraph meeting the reliability requirement includes:

F _k ＝(SSA _k /f _SSAk )/(SSE _k /f _SSEk )

In the above formula, SSA _k The sum of squares, f, within the group for the kth complete subgraph _SSAk Is SSA _k Degree of freedom of (SSE) _k Is the sum of squares between groups of the kth complete subgraph, f _SSEk Is SSE _k The degree of freedom of (c);

wherein, f _SSEk Total number of test data-f _SSAk -1，f _SSAk ＝1；

Wherein the Sum of Squares (SSA) within the group of the kth complete subgraph is determined as follows _k ：

the occupancy that the attribute value of the test result of the test data which is the same as the attribute value of the attribute corresponding to the kth full sub-graph is unavailable,

the attribute value of the test result of the test data having the attribute value different from that of the attribute corresponding to the kth complete subgraph isUnusable occupancy, x _kMi Is an attribute value coefficient of a test result of test data having the same attribute value as that of the attribute corresponding to the kth full sub-graph,

For example, in the test data shown in Table 1, the number of records with a second attribute value of Union in the original data set X is n _k ＝100，

SSA 2.4067, SSE 25.4933, so F14.9, and F-test threshold 20, so there is no significant difference.

After the reliability of the complete subgraph is judged, the attribute and the attribute value which are most relevant to the unavailability of the tested service need to be found out, and the nodes form a plurality of complete subgraphs; ideally, if a vertex failure is caused by the same event, the attributes of the vertices associated with the event will have the same value, and therefore all vertices with the same attribute value will form a complete subgraph; different faults form different complete subgraphs, but some complete subgraphs with different attributes are possibly caused by the same fault and need to be combined according to the correlation of the subgraphs, so that the fault location according to the number of nodes of the complete subgraph comprises the following steps:

for example: the node numbers of the complete subgraphs in table 1 are sorted from large to small by the second attribute value being connected (size 4, F check value 14.9), the fourth attribute value being L1 (size 4, F check value 28.2), the fifth attribute value being N11 (size 4, F check value 119.8), the third attribute value being X (size 3, F check value 3.6), the sixth attribute value being S11 (size 3, F check value 55.0), the seventh attribute value being low (size 2, F check value 1.9), and the remaining eight attribute values being arranged at the end (size 1), wherein the attribute values satisfying the reliability requirement are the fourth attribute value L1, the fifth attribute value N11, and the sixth attribute value S11.

Comparing the xth complete sub-graph C of the complete sub-graphs satisfying the reliability requirement _x Node of (2) and said C ₁ If the coincidence rate of the nodes is greater than the correlation threshold, C is set _x Adding the attribute value of the corresponding attribute into the fault set, wherein X belongs to [1, X ]]X is the total number of complete subgraphs meeting the reliability requirement;

and taking the attribute value in the fault set as a fault reason.

For example, initial failure sub-graph H when detecting the first failure in the example of Table 1 ₁ Is a complete subgraph composed of links, a failure set D, with the second attribute ₁ Has only one attribute D ₁ L1 for system region]Comparing the attribute and attribute value of the node number arranged at the second position of the complete subgraph with H ₁ If the contact ratio of (3) is more than 70%, D ₁ Including the attribute and attribute value, i.e. D ₁ L1 for system region and N11 for network device]Comparing the attribute and attribute value arranged in the third bit with H ₁ The degree of coincidence is more than 70%, so D ₁ Including the attribute and attribute value, i.e. D ₁ System area L1, network device N11, and server S11]There are no other complete subgraphs that satisfy reliability, final D ₁ System area L1, network device N11, and server S11]。

Further, the embodiment provided by the present invention may further include, removing the located fault cause, and continuing to perform fault location until all faults are located, specifically: after the fault cause is located in step 103, the test record of the fault cause is removed, and steps 101 to 103 are repeated in the remaining test record set, for example, inTable 1 is given by removing the term "satisfy D ₁ Recording the condition, subsequently detecting the fault subgraph, if the fault record exists, repeating the steps from 101 to 103 to obtain a fault reason set until no fault record exists, and completing fault detection.

The invention also provides a distributed system root cause fault positioning device, as shown in fig. 2, the device includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring test data of a distributed system and selecting the test data in the test data, of which the tested service is unavailable;

the construction unit is used for constructing a fault diagnosis graph by using test data, in the test data, of which the tested service is unavailable, and acquiring a complete subgraph in the fault diagnosis graph;

Wherein the distributed system test data comprises: the test condition attribute and the corresponding test result attribute thereof, wherein the test condition attribute comprises: external attributes and their corresponding internal attributes, the external attributes including: testing address, operator and tested service, the internal attribute corresponding to the external attribute includes: the method comprises the steps of obtaining a network device state, an operating system state and an application program state, wherein internal attributes corresponding to the external attributes are obtained by a data flow tracking method according to the external attributes.

The acquisition unit includes:

and the selection module is used for selecting the test data if the attribute value of the test result attribute corresponding to the test condition attribute of the test data is unavailable.

The construction unit comprises:

the building module is used for taking test data in the test data, of which the tested service is unavailable, as nodes, and connecting attribute values corresponding to attributes of the test data, of which the tested service is unavailable, among the nodes to obtain a fault diagnosis graph if the attribute values corresponding to the attributes of the test data, of which the tested service is unavailable, among the test data among the nodes are the same;

and the extraction module is used for extracting a connection graph between single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph.

The fault locating unit includes:

the acquisition module is used for verifying the reliability of the complete subgraph and acquiring the complete subgraph meeting the reliability requirement;

and the fault positioning module is used for positioning the fault according to the number of the nodes of the complete subgraph.

Wherein, F test is carried out on the kth complete subgraph, and the test value F of the kth complete subgraph is determined according to the following formula _k ：

F _k ＝(SSA _k /f _SSAk )/(SSE _k /f _SSEk )

Determining the Sum of Squares (SSA) within the group for the kth complete subgraph as follows _k ：

Determining the sum of squares SSE between groups of the kth complete subgraph by _k ：

n _k The number of test data having the same attribute value as that of the attribute corresponding to the kth complete subgraph,

test results for test data having the same attribute value as the attribute corresponding to the kth complete subgraphIs an unavailable occupancy rate,

occupancy rate of unavailable for attribute value of test result of test data different from attribute value of attribute corresponding to kth full subgraph, x _kMi Is an attribute value coefficient of a test result of test data having the same attribute value as that of the attribute corresponding to the kth full sub-graph,

The fault location module is further configured to:

acquiring complete subgraph C with maximum number of nodes in complete subgraph meeting reliability requirement ₁ And C is ₁ Adding the attribute value of the corresponding attribute into the fault set;

comparing the x-th complete subgraph C in the complete subgraphs satisfying the reliability requirement _x Node of (D) and said C ₁ If the coincidence rate of the nodes is greater than the correlation threshold, C is set _x Adding the attribute value of the corresponding attribute into the fault set, wherein X belongs to [1, X ]]X is the total number of complete subgraphs meeting the reliability requirement;

and taking the attribute value in the fault set as a fault reason.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A distributed system root cause fault positioning method is characterized by comprising the following steps:

carrying out fault positioning according to the complete subgraph;

the constructing a fault diagnosis graph by using the test data which is unavailable to the tested service in the test data and acquiring a complete subgraph in the fault diagnosis graph comprises the following steps:

taking test data with unavailable tested service in the test data as nodes, and if attribute values corresponding to attributes in the test data with unavailable tested service in the test data among the nodes are the same, connecting the attribute values corresponding to the attributes among the nodes to obtain a fault diagnosis graph;

extracting a connection graph among single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph;

the fault location according to the complete subgraph comprises the following steps:

carrying out fault positioning according to the number of the nodes of the complete subgraph;

the fault location according to the number of nodes of the complete subgraph comprises:

comparing the x-th complete subgraph C in the complete subgraphs satisfying the reliability requirement _x Node of (2) and said C ₁ If the coincidence rate of the nodes is greater than the correlation threshold, C is set _x Adding the attribute value of the corresponding attribute into the fault set, wherein X belongs to [1, X ]]X is the total number of complete subgraphs meeting the reliability requirement;

and taking the attribute value in the fault set as a fault reason.

2. The method of claim 1, wherein the distributed system test data comprises: the test condition attribute and the corresponding test result attribute thereof, wherein the test condition attribute comprises: external attributes and their corresponding internal attributes, the external attributes including: testing address, operator and tested service, the internal attribute corresponding to the external attribute includes: the method comprises the steps of obtaining a network device state, an operating system state and an application program state, wherein internal attributes corresponding to the external attributes are obtained by a data flow tracking method according to the external attributes.

3. The method of claim 2, wherein said selecting the test data for which the service under test is unavailable comprises:

4. The method of claim 1, wherein the verifying the reliability of the full subgraph, obtaining a full subgraph that meets reliability requirements, comprises:

f test is carried out on the kth complete subgraph, and a test value F of the kth complete subgraph is determined according to the formula _k ：

F _k ＝(SSA _k /f _SSAk )/(SSE _k /f _SSEk )

if the check value F of the kth complete subgraph _k And if the k-th complete subgraph is larger than the F check value threshold, the reliability requirement is met by the k-th complete subgraph.

5. The method of claim 4, wherein the intra-group sum of squares SSA for the kth complete subgraph is determined as follows _k ：

for occupancy in which the attribute value of the test result in the test data is unavailable,

to correspond to the kth complete subgraphThe number of test data having different attribute values of the attribute of (1),

occupancy rate of unavailable for attribute value of test result of test data different from attribute value of attribute corresponding to kth complete subgraph, x _kMi Is an attribute value coefficient of a test result of test data having the same attribute value as that of the attribute corresponding to the kth full subgraph,

6. A distributed system root cause fault location apparatus for implementing the distributed system root cause fault location method according to any one of claims 1 to 5, the apparatus comprising:

the fault positioning unit is used for carrying out fault positioning according to the complete subgraph;

the construction unit comprises:

the building module is used for taking test data with unavailable tested service in the test data as nodes, and if attribute values corresponding to attributes in the test data with unavailable tested service in the test data among the nodes are the same, connecting the attribute values corresponding to the attributes among the nodes to obtain a fault diagnosis graph;

the extraction module is used for extracting a connection graph between single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph;

the fault locating unit comprises: