CN109474445B - Distributed system root fault positioning method and device - Google Patents

Distributed system root fault positioning method and device Download PDF

Info

Publication number
CN109474445B
CN109474445B CN201710801677.9A CN201710801677A CN109474445B CN 109474445 B CN109474445 B CN 109474445B CN 201710801677 A CN201710801677 A CN 201710801677A CN 109474445 B CN109474445 B CN 109474445B
Authority
CN
China
Prior art keywords
test data
attribute
complete subgraph
fault
test
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710801677.9A
Other languages
Chinese (zh)
Other versions
CN109474445A (en
Inventor
赵丽
王泽�
郭三川
柳哲
何慧虹
徐太忠
潘欣雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201710801677.9A priority Critical patent/CN109474445B/en
Publication of CN109474445A publication Critical patent/CN109474445A/en
Application granted granted Critical
Publication of CN109474445B publication Critical patent/CN109474445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Abstract

The invention relates to a method and a device for positioning root cause faults of a distributed system, wherein the method comprises the following steps: acquiring test data of a distributed system, and selecting test data which is unavailable to a tested service in the test data; constructing a fault diagnosis graph by using test data which are unavailable to the tested service in the test data, and acquiring a complete subgraph in the fault diagnosis graph; according to the technical scheme provided by the invention, the test data provided by the invention is utilized to construct the fault diagnosis graph, the faults of the distributed system are quickly and accurately diagnosed according to the fault diagnosis graph, and troubles and economic losses caused by system failure are effectively prevented.

Description

Distributed system root fault positioning method and device
Technical Field
The invention relates to the technical field of data mining, in particular to a method and a device for locating root cause faults of a distributed system.
Background
With the rapid development of internet technologies such as cloud computing and big data, the scale of a distributed information system of a large enterprise is more and more huge. For example, by 2014 amazon has built 11 cloud zones, 28 data center groups, 200 ten thousand servers. By 2014, Google has owned 100 million servers in global data centers. Once a large number of business systems which are closely related to economic production and social life break down and suffer from network attack, great inconvenience and economic loss are brought to the production and life of the society, and even serious social security events are caused. For example, by 27 months 5 in 2015, a large area of paralysis occurs with a paupo having nearly 3 billion active users; the distance network can not be accessed for 12 hours in 2015, 5 months and 28 days, and the economic loss is about 5000 ten thousand yuan RMB. Therefore, in large-scale information systems, it is important to quickly and efficiently perform automatic fault detection and diagnosis in both research and practice.
The existing methods for analyzing the root cause fault of the distributed system comprise a graph-based method, an expert system-based method, an analysis model and a data driving method.
The fault diagnosis method based on graph theory mainly includes a Signed Directed Graph (SDG) method and a fault tree method. The SDG is a graphical model for describing the causal relationship of the system, nodes represent events or variables, and directed edges represent the causal relationship among the variables. And when a fault occurs, judging the cause of the fault by combining a certain search strategy according to the cause-and-effect relationship among the node changes. The fault tree is a logical diagram from effect to cause, and based on the fault state of the system, the fault tree performs inference step by step to determine the basic cause, the influence degree and the occurrence probability of the fault.
The fault diagnosis method based on the expert system utilizes the practical experience of field experts to establish a knowledge base and carries out reasoning and decision process to carry out fault diagnosis. Expert knowledge is often expressed in deterministic IF-THEN rules. Because the expert knowledge has uncertainty, the fuzzy expert system introduces the concept of fuzzy membership and utilizes fuzzy logic to carry out reasoning, so that the uncertainty in the expert knowledge can be well processed.
The fault diagnosis method based on the analytical model describes expected behaviors of the system by using an accurate mathematical model of the system, and carries out fault diagnosis by comparing whether actual input and output quantities are consistent or not, and mainly comprises a state estimation method, a parameter estimation method and an equivalent space-based method. The state estimation-based method mainly comprises a filter method and an observer method, and fault diagnosis is performed by using methods such as hypothesis testing and the like. The parameter estimation-based method considers that the fault causes a change in the parameters of the model, and performs fault diagnosis by detecting the parameter change in the model.
The data-driven fault diagnosis method carries out fault diagnosis by analyzing and processing system operation data without knowing a system accurate analysis model, and is mainly divided into a machine learning method, a multivariate statistical analysis method and the like. The machine learning method is used for fault diagnosis by training a neural network or a support vector machine and other machine learning algorithms by utilizing historical data of the system under normal and various fault conditions. The multivariate statistical analysis-based method is used for fault diagnosis by utilizing the correlation among a plurality of variables, and mainly comprises a PCA (principal component analysis) method. The PCA method projects a sample matrix of the variables to a low-dimensional space, reflects the main changes of the process variables, and carries out fault diagnosis through given detection indexes.
The existing fault diagnosis method for the distributed information system still has great limitation, and mainly has the following two reasons: on one hand, a distributed system fault diagnosis method for availability test failure is not intuitive enough, and in practice, as distributed information systems are more and more complex and fault points are difficult to detect comprehensively, faults are difficult to diagnose accurately, and more intuitive analysis and diagnosis tools are needed; on the other hand, the existing fault diagnosis method is difficult to solve the problem of real-time unsupervised fault diagnosis. The graph theory method, the expert system method and the data-driven method require system structure knowledge or training data, the analysis-based method is difficult to model a complex system, and the data or knowledge is difficult to obtain in practice, and meanwhile, the methods do not utilize information of system service availability test, and the information has important value for effective fault diagnosis.
Disclosure of Invention
The invention provides a method and a device for positioning root cause faults of a distributed system, and aims to construct a fault diagnosis diagram by using test data provided by the invention, quickly and accurately diagnose the faults of the distributed system according to the fault diagnosis diagram, and effectively prevent troubles and economic losses caused by system failure.
The purpose of the invention is realized by adopting the following technical scheme:
the improvement of a distributed system root cause fault positioning method is that the method comprises the following steps:
acquiring test data of a distributed system, and selecting test data which is unavailable to a tested service in the test data;
constructing a fault diagnosis graph by using test data which are unavailable to the tested service in the test data, and acquiring a complete subgraph in the fault diagnosis graph;
and carrying out fault positioning according to the complete subgraph.
Preferably, the distributed system test data includes: the test condition attribute and the corresponding test result attribute thereof, wherein the test condition attribute comprises: external attributes and their corresponding internal attributes, the external attributes including: testing address, operator and tested service, the internal attribute corresponding to the external attribute includes: the method comprises the steps of obtaining a network device state, an operating system state and an application program state, wherein internal attributes corresponding to the external attributes are obtained by a data flow tracking method according to the external attributes.
Further, the selecting test data, in the test data, for which the service under test is unavailable includes:
and if the attribute value of the test result attribute corresponding to the test condition attribute of the test data is unavailable, selecting the test data.
Preferably, the constructing a fault diagnosis graph by using the test data of the test data, in which the service to be tested is unavailable, and acquiring a complete sub-graph in the fault diagnosis graph includes:
taking test data in the test data, of which the tested service is unavailable, as nodes, and if attribute values corresponding to attributes in the test data, of which the tested service is unavailable, in the test data among the nodes are the same, connecting the attribute values corresponding to the attributes among the nodes to obtain a fault diagnosis graph;
and extracting a connection graph between single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph.
Preferably, the performing fault location according to the complete subgraph includes:
verifying the reliability of the complete subgraph and acquiring the complete subgraph meeting the reliability requirement;
and carrying out fault positioning according to the number of the nodes of the complete subgraph.
Further, the verifying the reliability of the complete sub-graph and obtaining the complete sub-graph meeting the reliability requirement includes:
f test is carried out on the kth complete subgraph, and the test value F of the kth complete subgraph is determined according to the formula k
F k =(SSA k /f SSAk )/(SSE k /f SSEk )
In the above formula, SSA k The sum of squares, f, within the group for the kth complete subgraph SSAk Is SSA k Degree of freedom of (SSE) k Is the sum of squared values between groups of the kth complete subgraph, f SSEk Is SSE k The degree of freedom of (c);
if the check value F of the kth complete subgraph k If the k-th complete subgraph is larger than the F check value threshold, the reliability requirement is met by the k-th complete subgraph.
Further, the Sum of Squares (SSA) in the group of the kth complete subgraph is determined as follows k
Determining the sum of squared between groups SSE of the kth complete subgraph as follows k
n k The number of test data having the same attribute value as that of the attribute corresponding to the kth full sub-graph,the occupancy that is unavailable is the attribute value of the test result of the test data that is the same as the attribute value of the attribute corresponding to the kth full subgraph,for the occupancy in the test data where the attribute value of the test result is unavailable,for the number of test data different in attribute value of the attribute corresponding to the kth full sub-graph,occupancy rate of unavailable for attribute value of test result of test data different from attribute value of attribute corresponding to kth full subgraph, x kMi Is an attribute value coefficient of a test result of test data having the same attribute value as that of the attribute corresponding to the kth full subgraph,the attribute value coefficient of the test result of the test data with different attribute values corresponding to the kth complete subgraph is set as 0 when the attribute value of the test result of the test data is available, and the attribute value coefficient of the test result of the test data is set as 1 when the attribute value of the test result of the test data is unavailable.
Further, the performing fault location according to the number of nodes in the complete subgraph includes:
acquiring complete subgraph C with maximum number of nodes in complete subgraph meeting reliability requirement 1 And C is 1 Adding the attribute value of the corresponding attribute into a fault set;
comparing the xth complete sub-graph C of the complete sub-graphs satisfying the reliability requirement x Node of (2) and said C 1 If the coincidence rate of the node (C) is greater than the correlation threshold value, C is set x Adding the attribute value of the corresponding attribute into the fault set, wherein X belongs to [1, X ∈ [ ]]X is the total number of complete subgraphs meeting the reliability requirement;
and taking the attribute value in the fault set as a fault reason.
In a distributed system root cause fault location apparatus, the improvement comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring test data of a distributed system and selecting the test data with unavailable tested service in the test data;
the building unit is used for building a fault diagnosis graph by using test data of which the tested service is unavailable in the test data and acquiring a complete subgraph in the fault diagnosis graph;
and the fault positioning unit is used for carrying out fault positioning according to the complete subgraph.
The invention has the beneficial effects that:
the technical scheme provided by the invention utilizes the test data which is unavailable in the tested service in the test data provided by the invention to construct the fault diagnosis graph, and acquiring a complete subgraph in the fault diagnosis graph, finally performing fault location according to the complete subgraph, describing and analyzing a test sample by adopting a multi-relation graph, being capable of visually representing the incidence relation among failure records in the current system, and better reflecting the direct and exact incidence relation of the failure nodes in the fault diagnosis by the multi-relation graph compared with a common weighted graph, by fault location of the related clusters, the attributes and attribute values related to failure can be clustered at the same time, and the failure records can be clustered, in many cases, the attribute values can clearly reflect the root cause of the fault, and the clustered results can roughly position the position of the root cause of the fault in the rest cases;
furthermore, in the fault positioning process, the reliability of the complete subgraph is judged by adopting F test, the condition that a larger complete subgraph is easy to form due to less values of attribute values of certain attributes is filtered, and the complete subgraph really related to the fault reason can be screened out.
Drawings
FIG. 1 is a flow chart of a distributed system root cause fault location method of the present invention;
fig. 2 is a schematic structural diagram of the root cause fault locating device of the distributed system of the invention.
Detailed Description
The following detailed description of the embodiments of the invention refers to the accompanying drawings.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a distributed system root fault positioning method, as shown in fig. 1, including:
101. acquiring test data of a distributed system, and selecting test data which is unavailable to a tested service in the test data;
102. constructing a fault diagnosis graph by using test data of which the tested service is unavailable in the test data, and acquiring a complete subgraph in the fault diagnosis graph;
103. and carrying out fault positioning according to the complete subgraph.
Wherein, the distributed system test data comprises: the test condition attribute and the corresponding test result attribute thereof, wherein the test condition attribute comprises: external attributes and their corresponding internal attributes, the external attributes including: testing address, operator and tested service, the internal attribute corresponding to the external attribute includes: the method comprises the steps of obtaining a network device state, an operating system state and an application program state, wherein internal attributes corresponding to the external attributes are obtained by a data flow tracking method according to the external attributes.
The test result attribute is a use result of a tested service which simulates a distributed system to use, if the tested service is available, the attribute value of the test result attribute is available, and if the tested service is unavailable, the attribute value of the test result attribute is unavailable.
Further, in the step 101, a specific process of selecting test data in the test data, where the service under test is unavailable, includes:
and if the attribute value of the test result attribute corresponding to the test condition attribute of the test data is unavailable, selecting the test data.
For example: acquiring test data of a distributed system, wherein the test data comprises test condition attributes and test result attributes corresponding to the test condition attributes, and the test condition attributes comprise: and the external test data and the internal monitoring data are correlated. The attributes of the external test data comprise a test address, an operator and a tested service; the internal monitoring data comprises a network equipment state, an operating system state and an application program state; let x i =[x i1 ,x i2 ,…,x iM ]Represents a complete usability test record, where x ik Representing the kth attribute value of the ith test record, and M represents the number of attributes; the data set for all test records is X ═ X i 1, …, | X |, where | X | represents the number of records in X, an example test data set is shown in table 1:
table 1 test data set example
The data in table 1 represents 16 original test records, that is, | X | ═ 16, and each piece of original data has 7 test condition attributes, including: the method comprises the following steps that 1 test result attribute comprises a test place, an operator, a tested service, a system region, network equipment, a server and a CPU occupancy rate, wherein the attribute value of 1 test result attribute is 1 or 0, when the tested service is available, the attribute value of 0 is 0, and when the tested service is unavailable, the attribute value of 1 is 1;
after the test data of the test data, in which the tested service is unavailable, is obtained, a fault diagnosis graph needs to be constructed by using the test data of the test data, in which the tested service is unavailable, and a complete sub-graph in the fault diagnosis graph is obtained, so that the step 102 includes:
taking test data in the test data, of which the tested service is unavailable, as nodes, and if attribute values corresponding to attributes in the test data, of which the tested service is unavailable, in the test data among the nodes are the same, connecting the attribute values corresponding to the attributes among the nodes to obtain a fault diagnosis graph;
and extracting a connection graph among single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph.
For example, as shown in table 1, the test data in which the service to be tested is unavailable in the test data is taken as nodes, which includes 1 st, 3 rd, 4 th and 5 th test data, which represent 4 nodes, the first attribute of the nodes has 4 values, each of which forms a complete subgraph with a node number of 1, the second attribute has 1 value, which forms a complete subgraph with a node number of 4, the third attribute has 2 values, which forms a complete subgraph with a node number of 3 and a complete subgraph with a node number of 1, the fourth attribute has 1 value, which forms a complete subgraph with a node number of 4, the fifth attribute has 1 value, which forms a complete subgraph with a node number of 4, and the sixth attribute has 2 values, which forms a complete subgraph with a node number of 3 and a complete subgraph with a node number of 1, for the seventh attribute, the nodes have 3 values, and a complete subgraph with 2 nodes and two complete subgraphs with 1 nodes are formed. Thus, there are actually 14 complete subgraphs for the set of records that are continuously failing in this example.
Further, fault location needs to be performed according to the complete subgraph, so that the step 103 includes:
verifying the reliability of the complete subgraph and acquiring the complete subgraph meeting the reliability requirement;
and carrying out fault positioning according to the number of the nodes of the complete subgraph.
Specifically, the verifying the reliability of the complete subgraph and acquiring the complete subgraph meeting the reliability requirement includes:
f test is carried out on the kth complete subgraph, and the test value F of the kth complete subgraph is determined according to the formula k
F k =(SSA k /f SSAk )/(SSE k /f SSEk )
In the above formula, SSA k The sum of squares, f, within the group for the kth complete subgraph SSAk Is SSA k Degree of freedom of (SSE) k Is the sum of squares between groups of the kth complete subgraph, f SSEk Is SSE k The degree of freedom of (c);
wherein, f SSEk Total number of test data-f SSAk -1,f SSAk =1;
If the check value F of the kth complete subgraph k If the k-th complete subgraph is larger than the F check value threshold, the reliability requirement is met by the k-th complete subgraph.
Wherein the Sum of Squares (SSA) within the group of the kth complete subgraph is determined as follows k
Determining the sum of squared between groups SSE of the kth complete subgraph as follows k
n k The number of test data having the same attribute value as that of the attribute corresponding to the kth full sub-graph,the occupancy that the attribute value of the test result of the test data which is the same as the attribute value of the attribute corresponding to the kth full sub-graph is unavailable,for the occupancy in the test data where the attribute value of the test result is unavailable,for the number of test data different in attribute value of the attribute corresponding to the kth full sub-graph,the attribute value of the test result of the test data having the attribute value different from that of the attribute corresponding to the kth complete subgraph isUnusable occupancy, x kMi Is an attribute value coefficient of a test result of test data having the same attribute value as that of the attribute corresponding to the kth full sub-graph,the attribute value coefficient of the test result of the test data with different attribute values corresponding to the kth complete subgraph is set as 0 when the attribute value of the test result of the test data is available, and the attribute value coefficient of the test result of the test data is set as 1 when the attribute value of the test result of the test data is unavailable.
For example, in the test data shown in Table 1, the number of records with a second attribute value of Union in the original data set X is n k =100,SSA 2.4067, SSE 25.4933, so F14.9, and F-test threshold 20, so there is no significant difference.
After the reliability of the complete subgraph is judged, the attribute and the attribute value which are most relevant to the unavailability of the tested service need to be found out, and the nodes form a plurality of complete subgraphs; ideally, if a vertex failure is caused by the same event, the attributes of the vertices associated with the event will have the same value, and therefore all vertices with the same attribute value will form a complete subgraph; different faults form different complete subgraphs, but some complete subgraphs with different attributes are possibly caused by the same fault and need to be combined according to the correlation of the subgraphs, so that the fault location according to the number of nodes of the complete subgraph comprises the following steps:
acquiring complete subgraph C with maximum number of nodes in complete subgraph meeting reliability requirement 1 And C is 1 Adding the attribute value of the corresponding attribute into a fault set;
for example: the node numbers of the complete subgraphs in table 1 are sorted from large to small by the second attribute value being connected (size 4, F check value 14.9), the fourth attribute value being L1 (size 4, F check value 28.2), the fifth attribute value being N11 (size 4, F check value 119.8), the third attribute value being X (size 3, F check value 3.6), the sixth attribute value being S11 (size 3, F check value 55.0), the seventh attribute value being low (size 2, F check value 1.9), and the remaining eight attribute values being arranged at the end (size 1), wherein the attribute values satisfying the reliability requirement are the fourth attribute value L1, the fifth attribute value N11, and the sixth attribute value S11.
Comparing the xth complete sub-graph C of the complete sub-graphs satisfying the reliability requirement x Node of (2) and said C 1 If the coincidence rate of the nodes is greater than the correlation threshold, C is set x Adding the attribute value of the corresponding attribute into the fault set, wherein X belongs to [1, X ]]X is the total number of complete subgraphs meeting the reliability requirement;
and taking the attribute value in the fault set as a fault reason.
For example, initial failure sub-graph H when detecting the first failure in the example of Table 1 1 Is a complete subgraph composed of links, a failure set D, with the second attribute 1 Has only one attribute D 1 L1 for system region]Comparing the attribute and attribute value of the node number arranged at the second position of the complete subgraph with H 1 If the contact ratio of (3) is more than 70%, D 1 Including the attribute and attribute value, i.e. D 1 L1 for system region and N11 for network device]Comparing the attribute and attribute value arranged in the third bit with H 1 The degree of coincidence is more than 70%, so D 1 Including the attribute and attribute value, i.e. D 1 System area L1, network device N11, and server S11]There are no other complete subgraphs that satisfy reliability, final D 1 System area L1, network device N11, and server S11]。
Further, the embodiment provided by the present invention may further include, removing the located fault cause, and continuing to perform fault location until all faults are located, specifically: after the fault cause is located in step 103, the test record of the fault cause is removed, and steps 101 to 103 are repeated in the remaining test record set, for example, inTable 1 is given by removing the term "satisfy D 1 Recording the condition, subsequently detecting the fault subgraph, if the fault record exists, repeating the steps from 101 to 103 to obtain a fault reason set until no fault record exists, and completing fault detection.
The invention also provides a distributed system root cause fault positioning device, as shown in fig. 2, the device includes:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring test data of a distributed system and selecting the test data in the test data, of which the tested service is unavailable;
the construction unit is used for constructing a fault diagnosis graph by using test data, in the test data, of which the tested service is unavailable, and acquiring a complete subgraph in the fault diagnosis graph;
and the fault positioning unit is used for carrying out fault positioning according to the complete subgraph.
Wherein the distributed system test data comprises: the test condition attribute and the corresponding test result attribute thereof, wherein the test condition attribute comprises: external attributes and their corresponding internal attributes, the external attributes including: testing address, operator and tested service, the internal attribute corresponding to the external attribute includes: the method comprises the steps of obtaining a network device state, an operating system state and an application program state, wherein internal attributes corresponding to the external attributes are obtained by a data flow tracking method according to the external attributes.
The acquisition unit includes:
and the selection module is used for selecting the test data if the attribute value of the test result attribute corresponding to the test condition attribute of the test data is unavailable.
The construction unit comprises:
the building module is used for taking test data in the test data, of which the tested service is unavailable, as nodes, and connecting attribute values corresponding to attributes of the test data, of which the tested service is unavailable, among the nodes to obtain a fault diagnosis graph if the attribute values corresponding to the attributes of the test data, of which the tested service is unavailable, among the test data among the nodes are the same;
and the extraction module is used for extracting a connection graph between single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph.
The fault locating unit includes:
the acquisition module is used for verifying the reliability of the complete subgraph and acquiring the complete subgraph meeting the reliability requirement;
and the fault positioning module is used for positioning the fault according to the number of the nodes of the complete subgraph.
Wherein, F test is carried out on the kth complete subgraph, and the test value F of the kth complete subgraph is determined according to the following formula k
F k =(SSA k /f SSAk )/(SSE k /f SSEk )
In the above formula, SSA k The sum of squares, f, within the group for the kth complete subgraph SSAk Is SSA k Degree of freedom of (SSE) k Is the sum of squared values between groups of the kth complete subgraph, f SSEk Is SSE k The degree of freedom of (c);
if the check value F of the kth complete subgraph k If the k-th complete subgraph is larger than the F check value threshold, the reliability requirement is met by the k-th complete subgraph.
Determining the Sum of Squares (SSA) within the group for the kth complete subgraph as follows k
Determining the sum of squares SSE between groups of the kth complete subgraph by k
n k The number of test data having the same attribute value as that of the attribute corresponding to the kth complete subgraph,test results for test data having the same attribute value as the attribute corresponding to the kth complete subgraphIs an unavailable occupancy rate,for the occupancy in the test data where the attribute value of the test result is unavailable,for the number of test data different in attribute value of the attribute corresponding to the kth full sub-graph,occupancy rate of unavailable for attribute value of test result of test data different from attribute value of attribute corresponding to kth full subgraph, x kMi Is an attribute value coefficient of a test result of test data having the same attribute value as that of the attribute corresponding to the kth full sub-graph,the attribute value coefficient of the test result of the test data with different attribute values corresponding to the kth complete subgraph is set as 0 when the attribute value of the test result of the test data is available, and the attribute value coefficient of the test result of the test data is set as 1 when the attribute value of the test result of the test data is unavailable.
The fault location module is further configured to:
acquiring complete subgraph C with maximum number of nodes in complete subgraph meeting reliability requirement 1 And C is 1 Adding the attribute value of the corresponding attribute into the fault set;
comparing the x-th complete subgraph C in the complete subgraphs satisfying the reliability requirement x Node of (D) and said C 1 If the coincidence rate of the nodes is greater than the correlation threshold, C is set x Adding the attribute value of the corresponding attribute into the fault set, wherein X belongs to [1, X ]]X is the total number of complete subgraphs meeting the reliability requirement;
and taking the attribute value in the fault set as a fault reason.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (6)

1. A distributed system root cause fault positioning method is characterized by comprising the following steps:
acquiring test data of a distributed system, and selecting test data which is unavailable to a tested service in the test data;
constructing a fault diagnosis graph by using test data which are unavailable to the tested service in the test data, and acquiring a complete subgraph in the fault diagnosis graph;
carrying out fault positioning according to the complete subgraph;
the constructing a fault diagnosis graph by using the test data which is unavailable to the tested service in the test data and acquiring a complete subgraph in the fault diagnosis graph comprises the following steps:
taking test data with unavailable tested service in the test data as nodes, and if attribute values corresponding to attributes in the test data with unavailable tested service in the test data among the nodes are the same, connecting the attribute values corresponding to the attributes among the nodes to obtain a fault diagnosis graph;
extracting a connection graph among single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph;
the fault location according to the complete subgraph comprises the following steps:
verifying the reliability of the complete subgraph and acquiring the complete subgraph meeting the reliability requirement;
carrying out fault positioning according to the number of the nodes of the complete subgraph;
the fault location according to the number of nodes of the complete subgraph comprises:
acquiring complete subgraph C with maximum number of nodes in complete subgraph meeting reliability requirement 1 And C is 1 Adding the attribute value of the corresponding attribute into the fault set;
comparing the x-th complete subgraph C in the complete subgraphs satisfying the reliability requirement x Node of (2) and said C 1 If the coincidence rate of the nodes is greater than the correlation threshold, C is set x Adding the attribute value of the corresponding attribute into the fault set, wherein X belongs to [1, X ]]X is the total number of complete subgraphs meeting the reliability requirement;
and taking the attribute value in the fault set as a fault reason.
2. The method of claim 1, wherein the distributed system test data comprises: the test condition attribute and the corresponding test result attribute thereof, wherein the test condition attribute comprises: external attributes and their corresponding internal attributes, the external attributes including: testing address, operator and tested service, the internal attribute corresponding to the external attribute includes: the method comprises the steps of obtaining a network device state, an operating system state and an application program state, wherein internal attributes corresponding to the external attributes are obtained by a data flow tracking method according to the external attributes.
3. The method of claim 2, wherein said selecting the test data for which the service under test is unavailable comprises:
and if the attribute value of the test result attribute corresponding to the test condition attribute of the test data is unavailable, selecting the test data.
4. The method of claim 1, wherein the verifying the reliability of the full subgraph, obtaining a full subgraph that meets reliability requirements, comprises:
f test is carried out on the kth complete subgraph, and a test value F of the kth complete subgraph is determined according to the formula k
F k =(SSA k /f SSAk )/(SSE k /f SSEk )
In the above formula, SSA k The sum of squares, f, within the group for the kth complete subgraph SSAk Is SSA k Degree of freedom of (SSE) k Is the sum of squared values between groups of the kth complete subgraph, f SSEk Is SSE k The degree of freedom of (c);
if the check value F of the kth complete subgraph k And if the k-th complete subgraph is larger than the F check value threshold, the reliability requirement is met by the k-th complete subgraph.
5. The method of claim 4, wherein the intra-group sum of squares SSA for the kth complete subgraph is determined as follows k
Determining the sum of squares SSE between groups of the kth complete subgraph by k
n k The number of test data having the same attribute value as that of the attribute corresponding to the kth full sub-graph,the occupancy that the attribute value of the test result of the test data which is the same as the attribute value of the attribute corresponding to the kth full sub-graph is unavailable,for occupancy in which the attribute value of the test result in the test data is unavailable,to correspond to the kth complete subgraphThe number of test data having different attribute values of the attribute of (1),occupancy rate of unavailable for attribute value of test result of test data different from attribute value of attribute corresponding to kth complete subgraph, x kMi Is an attribute value coefficient of a test result of test data having the same attribute value as that of the attribute corresponding to the kth full subgraph,the attribute value coefficient of the test result of the test data with different attribute values corresponding to the kth complete subgraph is set as 0 when the attribute value of the test result of the test data is available, and the attribute value coefficient of the test result of the test data is set as 1 when the attribute value of the test result of the test data is unavailable.
6. A distributed system root cause fault location apparatus for implementing the distributed system root cause fault location method according to any one of claims 1 to 5, the apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring test data of a distributed system and selecting the test data in the test data, of which the tested service is unavailable;
the construction unit is used for constructing a fault diagnosis graph by using test data, in the test data, of which the tested service is unavailable, and acquiring a complete subgraph in the fault diagnosis graph;
the fault positioning unit is used for carrying out fault positioning according to the complete subgraph;
the construction unit comprises:
the building module is used for taking test data with unavailable tested service in the test data as nodes, and if attribute values corresponding to attributes in the test data with unavailable tested service in the test data among the nodes are the same, connecting the attribute values corresponding to the attributes among the nodes to obtain a fault diagnosis graph;
the extraction module is used for extracting a connection graph between single attribute values corresponding to the single attributes of the nodes in the fault diagnosis graph as a complete subgraph;
the fault locating unit comprises:
the acquisition module is used for verifying the reliability of the complete subgraph and acquiring the complete subgraph meeting the reliability requirement;
and the fault positioning module is used for positioning the fault according to the number of the nodes of the complete subgraph.
CN201710801677.9A 2017-09-07 2017-09-07 Distributed system root fault positioning method and device Active CN109474445B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710801677.9A CN109474445B (en) 2017-09-07 2017-09-07 Distributed system root fault positioning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710801677.9A CN109474445B (en) 2017-09-07 2017-09-07 Distributed system root fault positioning method and device

Publications (2)

Publication Number Publication Date
CN109474445A CN109474445A (en) 2019-03-15
CN109474445B true CN109474445B (en) 2022-08-19

Family

ID=65658061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710801677.9A Active CN109474445B (en) 2017-09-07 2017-09-07 Distributed system root fault positioning method and device

Country Status (1)

Country Link
CN (1) CN109474445B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042714A (en) * 2007-04-29 2007-09-26 哈尔滨工业大学 Compressing method for SOC testing data suitable for suitable for multi-scanning chain designing core
CN101330417A (en) * 2008-07-24 2008-12-24 安徽大学 Quotient space overlay model for calculating network shortest path and building method thereof
CN103209094A (en) * 2013-03-11 2013-07-17 中国科学院信息工程研究所 System and method for fault positioning on basis of events
CN103914064A (en) * 2014-04-01 2014-07-09 浙江大学 Industrial process fault diagnosis method based on multiple classifiers and D-S evidence fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042714A (en) * 2007-04-29 2007-09-26 哈尔滨工业大学 Compressing method for SOC testing data suitable for suitable for multi-scanning chain designing core
CN101330417A (en) * 2008-07-24 2008-12-24 安徽大学 Quotient space overlay model for calculating network shortest path and building method thereof
CN103209094A (en) * 2013-03-11 2013-07-17 中国科学院信息工程研究所 System and method for fault positioning on basis of events
CN103914064A (en) * 2014-04-01 2014-07-09 浙江大学 Industrial process fault diagnosis method based on multiple classifiers and D-S evidence fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Adaptive Diagnosis in Distributed Systems;Irina Rish;《IEEE TRANSACTIONS ON NEURAL NETWORKS,》;20050930;第6卷(第5期);正文第2-4页 *
求最大完全子图的启发式着色算法;李建新;《滁州学院学报》;20100415(第02期);正文第2页 *

Also Published As

Publication number Publication date
CN109474445A (en) 2019-03-15

Similar Documents

Publication Publication Date Title
CN111124840B (en) Method and device for predicting alarm in business operation and maintenance and electronic equipment
CN103513983B (en) method and system for predictive alert threshold determination tool
CN105677791B (en) For analyzing the method and system of the operation data of wind power generating set
CN108923952B (en) Fault diagnosis method, equipment and storage medium based on service monitoring index
Nair et al. Learning a hierarchical monitoring system for detecting and diagnosing service issues
WO2009142832A2 (en) Ranking the importance of alerts for problem determination in large systems
CN111209131A (en) Method and system for determining fault of heterogeneous system based on machine learning
US8560279B2 (en) Method of determining the influence of a variable in a phenomenon
US20220078188A1 (en) Change Monitoring and Detection for a Cloud Computing Environment
CN109918313B (en) GBDT decision tree-based SaaS software performance fault diagnosis method
CN107168995B (en) Data processing method and server
CN111459700A (en) Method and apparatus for diagnosing device failure, diagnostic device, and storage medium
Lim et al. Identifying recurrent and unknown performance issues
US7716152B2 (en) Use of sequential nearest neighbor clustering for instance selection in machine condition monitoring
CN111290900A (en) Software fault detection method based on micro-service log
CN109474445B (en) Distributed system root fault positioning method and device
CN113391943A (en) Micro-service fault root cause positioning method and device based on cause and effect inference
CN111367781B (en) Instance processing method and device
CN110766100B (en) Bearing fault diagnosis model construction method, bearing fault diagnosis method and electronic equipment
CN111565118B (en) Virtualized network element fault analysis method and system based on multi-observation dimension HMM
CN111176872B (en) Monitoring data processing method, system, device and storage medium for IT operation and maintenance
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN111160329A (en) Root cause analysis method and device
Gelman et al. Novel anomaly detection technique based on the nearest neighbour and sequential methods
Zhao et al. G-FDDS: A graph-based fault diagnosis in distributed systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant