CN109474445A - A kind of distributed system root Fault Locating Method and device - Google Patents

A kind of distributed system root Fault Locating Method and device Download PDF

Info

Publication number
CN109474445A
CN109474445A CN201710801677.9A CN201710801677A CN109474445A CN 109474445 A CN109474445 A CN 109474445A CN 201710801677 A CN201710801677 A CN 201710801677A CN 109474445 A CN109474445 A CN 109474445A
Authority
CN
China
Prior art keywords
attribute
test data
complete subgraph
test
attribute value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710801677.9A
Other languages
Chinese (zh)
Other versions
CN109474445B (en
Inventor
赵丽
王泽�
郭三川
柳哲
何慧虹
徐太忠
潘欣雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201710801677.9A priority Critical patent/CN109474445B/en
Publication of CN109474445A publication Critical patent/CN109474445A/en
Application granted granted Critical
Publication of CN109474445B publication Critical patent/CN109474445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Abstract

The present invention relates to a kind of distributed system root Fault Locating Method and devices, which comprises obtains distributed system test data, and selects to be tested the not available test data of service in the test data;Trouble-shooting chart is constructed using the not available test data of service is tested in the test data, and obtains the complete subgraph in the trouble-shooting chart;Fault location is carried out according to the complete subgraph, technical solution provided by the invention, trouble-shooting chart is constructed using test data provided by the invention, Faults in Distributed Systems is rapidly and accurately diagnosed according to trouble-shooting chart, the puzzlement and economic loss for effectivelying prevent thrashing to generate.

Description

A kind of distributed system root Fault Locating Method and device
Technical field
The present invention relates to data mining technology fields, and in particular to a kind of distributed system root Fault Locating Method and dress It sets.
Background technique
With the rapid development of the Internet technologies such as cloud computing and big data, the scale of large enterprise's distributed information system It is more and more huger.For example, cut-off 2014, Amazon has built up 11 cloud region, 28 data central sets, and 2,000,000 Platform server.Cut-off 2014, Google has 1,000,000 servers in global data center.Once these are carried It largely breaks down with the closely bound up operation system of economical production and social life, by network attack, it will be to society Production and living bring greatly inconvenient and economic loss, even result in serious social security events.For example, in May, 2015 27, there is large area paralysis in the Alipay for possessing nearly 300,000,000 any active ues;Ctrip.com is small up to 12 on May 28th, 2015 Shi Wufa access, about 50,000,000 yuans of economic loss.Therefore, in extensive information system, event is fast and effeciently carried out The automatic detection of barrier and diagnosis are all very important in research and practice.
The method of present analysis distributed system root failure include the method based on figure, the method based on expert system, Analysis model, data-driven method.
Method for diagnosing faults based on graph theory mainly includes signed digraph (Signed directed graph, SDG) side Method and fault tree analysis method.SDG is a kind of causal graphical model of description system, and node indicates event or variable, has The causality between variable is indicated to side.When breaking down, causality between being changed according to node, and combine certain search Rope strategy judges the reason of failure occurs.Fault tree be it is a kind of by fruit to because logic chart, it from the malfunction of system, Fundamental cause, influence degree and probability of happening that determining failure occurs are made inferences step by step.
Knowledge base is established using the practical experience of domain expert based on the method for diagnosing faults of expert system, and is made inferences Fault diagnosis is carried out with decision process.Expertise commonly uses deterministic IF-THEN Rule Expression.Since expertise has not Certainty, fuzzy expert system introduces the concept of fuzzy membership, and is made inferences using fuzzy logic, can be located in very well Manage the uncertainty in expertise.
System expected behavior is described using the accurate mathematical model of system based on the method for diagnosing faults of analytic modell analytical model, is passed through It compares whether actual input and output amount unanimously carries out fault diagnosis, mainly includes method for estimating state, method for parameter estimation, base In the method for equivalent space.Method based on state estimation mainly includes filtered method and observer method, is examined using hypothesis It the methods of tests and to carry out fault diagnosis.Method based on parameter Estimation thinks that failure can cause the variation of model parameter, and passes through inspection The Parameters variation surveyed in model carries out fault diagnosis.
The method for diagnosing faults of data-driven is not required to by being analyzed and processed carry out fault diagnosis to system operation data Know system accurate Analysis model, is broadly divided into machine learning class method, multi-variate statistical analysis class method etc..Machine learning class side Method is the historical data training machines such as neural network or support vector machines using system under normal and various fault conditions Learning algorithm is used for fault diagnosis.Method based on multi-variate statistical analysis is to carry out failure using the correlation between multiple variables Diagnosis, mainly there is PCA method.The sample matrix of variable is projected to lower dimensional space by PCA method, reflects the main of process variable Variation, and fault diagnosis is carried out by given Testing index.
There are still significant limitations for existing distributed information system method for diagnosing faults, mainly have following two aspect former Cause: it is on the one hand, not intuitive enough for the Faults in Distributed Systems diagnostic method of usability testing failure, in practice, due to distribution Formula information system becomes increasingly complex, and fault point is difficult to be detected comprehensively, therefore is difficult accurately to diagnose fault, and needs more Intuitive analysis and diagnosis tool;On the other hand, existing method for diagnosing faults is difficult to solve real-time, unsupervised fault diagnosis Problem.The method of graph theory, the method for expert system and the method based on data-driven need system structure knowledge or training data, Method based on parsing is difficult to complex system modeling, and these data or knowledge are difficult to obtain in practice, meanwhile, these methods Also without the information using system business usability testing, and these information have important value for effective fault diagnosis.
Summary of the invention
The present invention provides a kind of distributed system root Fault Locating Method and device, and the purpose is to be provided using the present invention Test data construct trouble-shooting chart, Faults in Distributed Systems is rapidly and accurately diagnosed according to trouble-shooting chart, is effectively prevent The puzzlement and economic loss that thrashing generates.
The purpose of the present invention is adopt the following technical solutions realization:
A kind of distributed system root Fault Locating Method, it is improved in that including:
Distributed system test data is obtained, and selects to be tested the not available test data of service in the test data;
Trouble-shooting chart is constructed using the not available test data of service is tested in the test data, and obtains the event Hinder the complete subgraph in diagnostic graph;
Fault location is carried out according to the complete subgraph.
Preferably, the distributed system test data, comprising: test condition attribute and its corresponding test result category Property, the test condition attribute include: external attribute and its corresponding built-in attribute, the external attribute include: test address, Operator and tested service, the corresponding built-in attribute of the external attribute include: network device state, operation system state and answer With program state, wherein the corresponding built-in attribute of the external attribute is obtained according to the external attribute using data flow tracing It takes.
Further, the not available test data of service is tested in the selection test data, comprising:
If the attribute value of the corresponding test result attribute of the test condition attribute of the test data be it is unavailable, select The test data.
It is preferably, described to construct trouble-shooting chart using the not available test data of service tested in the test data, And obtain the complete subgraph in the trouble-shooting chart, comprising:
The not available test data of service is tested using in the test data as node, if the test number between node The corresponding attribute value of attribute being tested in the not available test data of service in is identical, then by the corresponding category of the attribute between node Property value connection, obtain trouble-shooting chart;
The connection figure between the corresponding single attribute value of single attribute of the trouble-shooting chart interior joint is extracted as complete Subgraph.
It is preferably, described that fault location is carried out according to the complete subgraph, comprising:
The reliability of the complete subgraph is verified, the complete subgraph for meeting reliability requirement is obtained;
Fault location is carried out according to the number of nodes of the complete subgraph.
Further, the reliability of the verifying complete subgraph, obtains the complete subgraph for meeting reliability requirement, packet It includes:
F inspection is carried out to k-th of complete subgraph, determines the test value F of k-th of complete subgraph as the following formulak:
Fk=(SSAk/fSSAk)/(SSEk/fSSEk)
In above formula, SSAkFor quadratic sum in the group of k-th of complete subgraph, fSSAkFor SSAkFreedom degree, SSEkIt is k-th The sum of squares between groups of complete subgraph, fSSEkFor SSEkFreedom degree;
If the test value F of k-th of complete subgraphkGreater than F test value threshold value, then k-th of complete subgraph meets Reliability requirement.
Further, quadratic sum SSA in the group of k-th of complete subgraph is determined as the following formulak:
The sum of squares between groups SSE of k-th of complete subgraph is determined as the following formulak:
nkFor the number of the identical test data of attribute value of attribute corresponding with k-th of complete subgraph,For with k-th The attribute value of the test result of the identical test data of attribute value of the corresponding attribute of complete subgraph is not available occupation rate, Attribute value for test result in test data is not available occupation rate,For attribute corresponding with k-th of complete subgraph The number of the different test data of attribute value,For the different test of the attribute value of attribute corresponding from k-th of complete subgraph The attribute value of the test result of data is not available occupation rate, xkMiFor the attribute value of attribute corresponding with k-th of complete subgraph The attribute value coefficient of the test result of identical test data,For the attribute value of attribute corresponding with k-th of complete subgraph The attribute value coefficient of the test result of different test datas, it is right when the attribute value of the test result of test data is available The attribute value coefficient of the test result for the test data answered is 0, when the attribute value of the test result of test data is unavailable, The attribute value coefficient of the test result of corresponding test data is 1.
It is further, described that fault location is carried out according to the number of nodes of the complete subgraph, comprising:
Obtain the maximum complete subgraph C of complete subgraph interior joint number for meeting reliability requirement1, and by C1Corresponding attribute Attribute value be added failure collection in;
Compare x-th of the complete subgraph C met in the complete subgraph of reliability requirementxNode and the C1Node Coincidence factor, if the coincidence factor is greater than relevance threshold, by CxThe attribute value of corresponding attribute is added in the failure collection, In, x ∈ [1, X], X are the complete subgraph sum for meeting reliability requirement;
Using the attribute value in the failure collection as failure cause.
A kind of distributed system root fault locator, it is improved in that described device includes:
Acquiring unit, for obtaining distributed system test data, and select in the test data be tested service can not Test data;
Construction unit, for constructing fault diagnosis using the not available test data of service tested in the test data Figure, and obtain the complete subgraph in the trouble-shooting chart;
Failure location unit, for carrying out fault location according to the complete subgraph.
Beneficial effects of the present invention:
Technical solution provided by the invention services not available test number using being tested in test data provided by the invention According to building trouble-shooting chart, and the complete subgraph in the trouble-shooting chart is obtained, event is finally carried out according to the complete subgraph Barrier positioning, is described and is analyzed to test sample using more relational graphs, can more intuitively showed in current system and fail Incidence relation between record, relative to relatively conventional weighted graph, more relational graphs can preferably reflect in fault diagnosis The directly exact incidence relation of failure node realizes while will have in failure by carrying out fault location to the relationship cluster of mistake The attribute and attribute value of pass cluster and the cluster of failure logging (record), and attribute value has been able to explicitly embody in many cases Root failure cause, the result after clustering in remaining situation can substantially position root location of fault;
Further, during fault location, the reliability of complete subgraph is determined using F inspection, is filtered out The attribute value of certain attributes is since value is less, the situation of biggish complete subgraph easy to form, can will really with failure original Because relevant complete subgraph screens.
Detailed description of the invention
Fig. 1 is a kind of flow chart of distributed system root Fault Locating Method of the present invention;
Fig. 2 is the structural schematic diagram for inventing a kind of distributed system root fault locator.
Specific embodiment
It elaborates with reference to the accompanying drawing to a specific embodiment of the invention.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art All other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
A kind of distributed system root Fault Locating Method provided by the invention, as shown in Figure 1, comprising:
101. obtaining distributed system test data, and select to be tested the not available test number of service in the test data According to;
102. constructing trouble-shooting chart using the not available test data of service is tested in the test data, and obtain institute State the complete subgraph in trouble-shooting chart;
103. carrying out fault location according to the complete subgraph.
Wherein, the distributed system test data, comprising: test condition attribute and its corresponding test result attribute, The test condition attribute includes: external attribute and its corresponding built-in attribute, and the external attribute includes: test address, fortune Seek quotient and tested service, the corresponding built-in attribute of the external attribute includes: network device state, operation system state and application Program state, wherein the corresponding built-in attribute of the external attribute is obtained according to the external attribute using data flow tracing It takes.
The test result attribute is the use of tested service of the simulation using distributed system as a result, if the tested service It can use, then the attribute value of the test result attribute is available, if the tested service is unavailable, the test result attribute Attribute value is unavailable.
Further, in the step 101, the tool that the not available test data of service is tested in the test data is selected Body process includes:
If the attribute value of the corresponding test result attribute of the test condition attribute of the test data be it is unavailable, select The test data.
Such as: distributed system test data, including test condition attribute and its corresponding test result attribute are obtained, In, test condition attribute includes: external test data and internal monitoring data, and data are associated.External test data Attribute include test address, operator and tested service;Internal monitoring data include network device state, operation system state And Application Status;Enable xi=[xi1,xi2,…,xiM] indicate complete usability testing record, wherein xikIndicate i-th of survey K-th of attribute value of trial record, M indicate attribute number;The data set of all test records is X={ xi, i=1 ..., | X |, Wherein | X | indicate the record number in X, test data set example is as shown in table 1:
1 test data set example of table
Wherein, the data in table 1 represent 16 original test records, i.e., | X |=16, and every initial data has 7 surveys Try conditional attribute, comprising: test ground, operator, tested service, system region, the network equipment, server and CPU occupation rate, 1 A test result attribute, attribute value are 1 or 0, when tested service is available, attribute value 0, when tested service is unavailable When, attribute value 1;
It obtains and is tested in the test data after the not available test data of service, quilt in the test data need to be utilized It surveys and services not available test data building trouble-shooting chart, and obtain the complete subgraph in the trouble-shooting chart, therefore, institute State step 102, comprising:
The not available test data of service is tested using in the test data as node, if the test number between node The corresponding attribute value of attribute being tested in the not available test data of service in is identical, then by the corresponding category of the attribute between node Property value connection, obtain trouble-shooting chart;
The connection figure between the corresponding single attribute value of single attribute of the trouble-shooting chart interior joint is extracted as complete Subgraph.
For example, in test data as shown in Table 1, will be tested in the test data the not available test data of service as Node, including the 1st, 3,4 and 5 article of test data indicate 4 nodes, first attribute test in node attribute have 4 Value, each value form the complete subgraph that a number of nodes is 1, and for second attribute, node has 1 value, are formed The complete subgraph that one number of nodes is 4, for third attribute, node has 2 values, formed a number of nodes for 3 it is complete The complete subgraph that subgraph and a number of nodes are 1, for the 4th attribute, node has 1 value, and forming a number of nodes is 4 Complete subgraph, for the 5th attribute, node has 1 value, formed a number of nodes be 4 complete subgraph, for the 6th A attribute, node have 2 values, form the complete subgraph that a number of nodes is 3 and the complete subgraph that a number of nodes is 1, right In the 7th attribute, node has 3 values, forms the complete subgraph that a number of nodes is 2 and the complete son that two number of nodes are 1 Figure.Therefore, practical for the set of records ends of lasting failure in the example to share 14 complete subgraphs.
Further, fault location, therefore the step 103 need to be carried out according to the complete subgraph, comprising:
The reliability of the complete subgraph is verified, the complete subgraph for meeting reliability requirement is obtained;
Fault location is carried out according to the number of nodes of the complete subgraph.
Specifically, the reliability of the verifying complete subgraph, obtains the complete subgraph for meeting reliability requirement, packet It includes:
F inspection is carried out to k-th of complete subgraph, determines the test value F of k-th of complete subgraph as the following formulak:
Fk=(SSAk/fSSAk)/(SSEk/fSSEk)
In above formula, SSAkFor quadratic sum in the group of k-th of complete subgraph, fSSAkFor SSAkFreedom degree, SSEkIt is k-th The sum of squares between groups of complete subgraph, fSSEkFor SSEkFreedom degree;
Wherein, fSSEk=test data sum-fSSAk- 1, fSSAk=1;
If the test value F of k-th of complete subgraphkGreater than F test value threshold value, then k-th of complete subgraph meets Reliability requirement.
Wherein, quadratic sum SSA in the group of k-th of complete subgraph is determined as the following formulak:
The sum of squares between groups SSE of k-th of complete subgraph is determined as the following formulak:
nkFor the number of the identical test data of attribute value of attribute corresponding with k-th of complete subgraph,For with kth The attribute value of the test result of the identical test data of attribute value of the corresponding attribute of a complete subgraph is not available occupation rate,Attribute value for test result in test data is not available occupation rate,For attribute corresponding with k-th of complete subgraph The different test data of attribute value number,For the different survey of the attribute value of attribute corresponding from k-th of complete subgraph The attribute value for trying the test result of data is not available occupation rate, xkMiFor the attribute of attribute corresponding with k-th of complete subgraph It is worth the attribute value coefficient of the test result of identical test data,For the attribute of attribute corresponding with k-th of complete subgraph It is worth the attribute value coefficient of the test result of different test datas, when the attribute value of the test result of test data is available, The attribute value coefficient of the test result of corresponding test data is 0, when the attribute value of the test result of test data is unavailable, The attribute value coefficient of the test result of its corresponding test data is 1.
For example, second attribute value is that the record number of connection is in raw data set X in test data as shown in Table 1 nk=100,SSA=2.4067, SSE=25.4933, Therefore F=14.9, and it is 20 that F, which examines threshold value, therefore does not have significant difference.
After judging complete subgraph reliability, it need to find out and service unavailable maximally related attribute and attribute value with tested, with And node constitutes several complete subgraphs;Ideally, if vertex failure is as caused by identical event, with event phase The attribute on the vertex of pass is by value having the same, therefore all vertex with same alike result value will form a complete subgraph; Different failures forms different complete subgraphs, but the complete subgraph of certain different attributes may be since same fault causes , it needs to be merged according to the correlation of subgraph, it is therefore, described fixed according to the number of nodes of complete subgraph progress failure Position, comprising:
Obtain the maximum complete subgraph C of complete subgraph interior joint number for meeting reliability requirement1, and by C1Corresponding attribute Attribute value be added failure collection in;
Such as: the descending sequence of the number of nodes of the complete subgraph in table 1 is that second attribute value is that (size is connection 4, F test values be 14.9), the 4th attribute value be L1 (size 4, F test value be 28.2), the 5th attribute value be N11 (greatly It is small be 4, F test value be 119.8), third attribute value be that (size 3, for F test value for 3.6), the 6th attribute value is S11 to X (size 3, F test value are that 55.0), the 7th attribute value is that low (size 2, F test value are 1.9) remaining eight attribute value Come last (size 1), wherein meet the attribute of reliability requirement and attribute value be the 4th attribute value be L1, the 5th category Property value is N11, the 6th attribute value is S11.
Compare x-th of the complete subgraph C met in the complete subgraph of reliability requirementxNode and the C1Node Coincidence factor, if the coincidence factor is greater than relevance threshold, by CxThe attribute value of corresponding attribute is added in the failure collection, In, x ∈ [1, X], X are the complete subgraph sum for meeting reliability requirement;
Using the attribute value in the failure collection as failure cause.
For example, when detecting first failure in the example of table 1, primary fault subgraph H1It is second attribute for connection structure At complete subgraph, failure collection D1In only one attribute D1The number of nodes row of complete subgraph is compared in=[system region=L1] In deputy attribute and attribute value, with H1Registration, if more than 70%, then D1In be included in the attribute and attribute value, i.e. D1= [system region=L1, the network equipment=N11], compares the attribute and attribute value for coming third position, with H1Registration be greater than 70%, so D1In be included in the attribute and attribute value, i.e. D1=[system region=L1, the network equipment=N11, server= S11], the complete subgraph of reliability, final D are met without other1=[system region=L1, the network equipment=N11, server =S11].
Further, embodiment provided by the invention use in can also include excluding oriented failure cause, continue into Row fault location, it is faulty until orienting, specifically: after step 103 navigates to failure cause, remove the failure cause Test record repeats step 101 to step 103 in remaining test record set, meets D for example, removing in table 11Condition Record, subsequent detection failure subgraph, failure logging (record) if it exists, then repeat step 101 to step 103 obtain failure cause collection It closes, until not having failure logging (record), fault detection is finished.
The present invention also provides a kind of distributed system root fault locators, as shown in Fig. 2, described device includes:
Acquiring unit, for obtaining distributed system test data, and select in the test data be tested service can not Test data;
Construction unit, for constructing fault diagnosis using the not available test data of service tested in the test data Figure, and obtain the complete subgraph in the trouble-shooting chart;
Failure location unit, for carrying out fault location according to the complete subgraph.
Wherein, the distributed system test data, comprising: test condition attribute and its corresponding test result attribute, The test condition attribute includes: external attribute and its corresponding built-in attribute, and the external attribute includes: test address, fortune Seek quotient and tested service, the corresponding built-in attribute of the external attribute includes: network device state, operation system state and application Program state, wherein the corresponding built-in attribute of the external attribute is obtained according to the external attribute using data flow tracing It takes.
The acquiring unit, comprising:
Selecting module, if the attribute value of the corresponding test result attribute of test condition attribute for the test data is It is unavailable, then select the test data.
The construction unit, comprising:
Module is constructed, for being tested the not available test data of service using in the test data as node, if node Between the test data in the corresponding attribute value of attribute that is tested in the not available test data of service it is identical, then will be between node The corresponding attribute value connection of the attribute, obtains trouble-shooting chart;
Extraction module, the company between the corresponding single attribute value of single attribute for extracting the trouble-shooting chart interior joint Map interlinking is as complete subgraph.
The failure location unit, comprising:
Module is obtained, for verifying the reliability of the complete subgraph, obtains the complete subgraph for meeting reliability requirement;
Fault location module, for carrying out fault location according to the number of nodes of the complete subgraph.
Wherein, F inspection is carried out to k-th of complete subgraph, determines the test value F of k-th of complete subgraph as the following formulak:
Fk=(SSAk/fSSAk)/(SSEk/fSSEk)
In above formula, SSAkFor quadratic sum in the group of k-th of complete subgraph, fSSAkFor SSAkFreedom degree, SSEkIt is k-th The sum of squares between groups of complete subgraph, fSSEkFor SSEkFreedom degree;
If the test value F of k-th of complete subgraphkGreater than F test value threshold value, then k-th of complete subgraph meets Reliability requirement.
Quadratic sum SSA in the group of k-th of complete subgraph is determined as the following formulak:
The sum of squares between groups SSE of k-th of complete subgraph is determined as the following formulak:
nkFor the number of the identical test data of attribute value of attribute corresponding with k-th of complete subgraph,For with k-th The attribute value of the test result of the identical test data of attribute value of the corresponding attribute of complete subgraph is not available occupation rate, Attribute value for test result in test data is not available occupation rate,For attribute corresponding with k-th of complete subgraph The number of the different test data of attribute value,For the different test of the attribute value of attribute corresponding from k-th of complete subgraph The attribute value of the test result of data is not available occupation rate, xkMiFor the attribute value of attribute corresponding with k-th of complete subgraph The attribute value coefficient of the test result of identical test data,For the attribute value of attribute corresponding with k-th of complete subgraph The attribute value coefficient of the test result of different test datas, it is right when the attribute value of the test result of test data is available The attribute value coefficient of the test result for the test data answered is 0, when the attribute value of the test result of test data is unavailable, The attribute value coefficient of the test result of corresponding test data is 1.
The fault location module, is also used to:
Obtain the maximum complete subgraph C of complete subgraph interior joint number for meeting reliability requirement1, and by C1Corresponding attribute Attribute value be added failure collection in;
Compare x-th of the complete subgraph C met in the complete subgraph of reliability requirementxNode and the C1Node Coincidence factor, if the coincidence factor is greater than relevance threshold, by CxThe attribute value of corresponding attribute is added in the failure collection, In, x ∈ [1, X], X are the complete subgraph sum for meeting reliability requirement;
Using the attribute value in the failure collection as failure cause.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Finally it should be noted that: the above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, to the greatest extent Invention is explained in detail referring to above-described embodiment for pipe, it should be understood by those ordinary skilled in the art that: still It can be with modifications or equivalent substitutions are made to specific embodiments of the invention, and without departing from any of spirit and scope of the invention Modification or equivalent replacement, should all cover within the scope of the claims of the present invention.

Claims (9)

1. a kind of distributed system root Fault Locating Method, which is characterized in that the described method includes:
Distributed system test data is obtained, and selects to be tested the not available test data of service in the test data;
Trouble-shooting chart is constructed using the not available test data of service is tested in the test data, and obtains the failure and examines Complete subgraph in disconnected figure;
Fault location is carried out according to the complete subgraph.
2. the method as described in claim 1, which is characterized in that the distributed system test data, comprising: test condition category Property and its corresponding test result attribute, the test condition attribute include: external attribute and its corresponding built-in attribute, described External attribute includes: test address, operator and tested service, and the corresponding built-in attribute of the external attribute includes: that network is set Standby state, operation system state and Application Status, wherein the corresponding built-in attribute of the external attribute is according to the outside Attribute is obtained using data flow tracing.
3. method according to claim 2, which is characterized in that it is not available to be tested service in the selection test data Test data, comprising:
If the attribute value of the corresponding test result attribute of the test condition attribute of the test data be it is unavailable, select the survey Try data.
4. the method as described in claim 1, which is characterized in that described not available using service is tested in the test data Test data constructs trouble-shooting chart, and obtains the complete subgraph in the trouble-shooting chart, comprising:
The not available test data of service is tested using in the test data as node, if in the test data between node The corresponding attribute value of attribute in the tested not available test data of service is identical, then by the corresponding attribute value of the attribute between node Connection obtains trouble-shooting chart;
The connection figure between the corresponding single attribute value of single attribute of the trouble-shooting chart interior joint is extracted as complete subgraph.
5. the method as described in claim 1, which is characterized in that described to carry out fault location according to the complete subgraph, comprising:
The reliability of the complete subgraph is verified, the complete subgraph for meeting reliability requirement is obtained;
Fault location is carried out according to the number of nodes of the complete subgraph.
6. method as claimed in claim 5, which is characterized in that the reliability of the verifying complete subgraph obtains and meets The complete subgraph of reliability requirement, comprising:
F inspection is carried out to k-th of complete subgraph, determines the test value F of k-th of complete subgraph as the following formulak:
Fk=(SSAk/fSSAk)/(SSEk/fSSEk)
In above formula, SSAkFor quadratic sum in the group of k-th of complete subgraph, fSSAkFor SSAkFreedom degree, SSEkIt is completely sub for k-th The sum of squares between groups of figure, fSSEkFor SSEkFreedom degree;
If the test value F of k-th of complete subgraphkGreater than F test value threshold value, then k-th of complete subgraph meets reliability It is required that.
7. method as claimed in claim 6, which is characterized in that determine interior square of the group of k-th of complete subgraph as the following formula And SSAk:
The sum of squares between groups SSE of k-th of complete subgraph is determined as the following formulak:
nkFor the number of the identical test data of attribute value of attribute corresponding with k-th of complete subgraph,For with k-th completely The attribute value of the test result of the identical test data of attribute value of the corresponding attribute of subgraph is not available occupation rate,To survey The attribute value for trying test result in data is not available occupation rate,For the attribute of attribute corresponding with k-th of complete subgraph It is worth the number of different test datas,For the different test data of the attribute value of attribute corresponding from k-th of complete subgraph Test result attribute value be not available occupation rate, xkMiAttribute value for attribute corresponding with k-th of complete subgraph is identical Test data test result attribute value coefficient,Attribute value for attribute corresponding from k-th of complete subgraph is different Test data test result attribute value coefficient, it is corresponding when the attribute value of the test result of test data is available The attribute value coefficient of the test result of test data is 0, when the attribute value of the test result of test data is unavailable, is corresponded to Test data test result attribute value coefficient be 1.
8. method as claimed in claim 5, which is characterized in that described fixed according to the number of nodes of complete subgraph progress failure Position, comprising:
Obtain the maximum complete subgraph C of complete subgraph interior joint number for meeting reliability requirement1, and by C1The category of corresponding attribute Property value be added failure collection in;
Compare x-th of the complete subgraph C met in the complete subgraph of reliability requirementxNode and the C1Node coincidence Rate, if the coincidence factor is greater than relevance threshold, by CxThe attribute value of corresponding attribute is added in the failure collection, wherein x ∈ [1, X], X are the complete subgraph sum for meeting reliability requirement;
Using the attribute value in the failure collection as failure cause.
9. a kind of distributed system root fault locator, which is characterized in that described device includes:
Acquiring unit for obtaining distributed system test data, and selects tested service in the test data not available Test data;
Construction unit, for constructing trouble-shooting chart using the not available test data of service tested in the test data, and Obtain the complete subgraph in the trouble-shooting chart;
Failure location unit, for carrying out fault location according to the complete subgraph.
CN201710801677.9A 2017-09-07 2017-09-07 Distributed system root fault positioning method and device Active CN109474445B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710801677.9A CN109474445B (en) 2017-09-07 2017-09-07 Distributed system root fault positioning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710801677.9A CN109474445B (en) 2017-09-07 2017-09-07 Distributed system root fault positioning method and device

Publications (2)

Publication Number Publication Date
CN109474445A true CN109474445A (en) 2019-03-15
CN109474445B CN109474445B (en) 2022-08-19

Family

ID=65658061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710801677.9A Active CN109474445B (en) 2017-09-07 2017-09-07 Distributed system root fault positioning method and device

Country Status (1)

Country Link
CN (1) CN109474445B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042714A (en) * 2007-04-29 2007-09-26 哈尔滨工业大学 Compressing method for SOC testing data suitable for suitable for multi-scanning chain designing core
CN101330417A (en) * 2008-07-24 2008-12-24 安徽大学 Quotient space overlay model for calculating network shortest path and building method thereof
CN103209094A (en) * 2013-03-11 2013-07-17 中国科学院信息工程研究所 System and method for fault positioning on basis of events
CN103914064A (en) * 2014-04-01 2014-07-09 浙江大学 Industrial process fault diagnosis method based on multiple classifiers and D-S evidence fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042714A (en) * 2007-04-29 2007-09-26 哈尔滨工业大学 Compressing method for SOC testing data suitable for suitable for multi-scanning chain designing core
CN101330417A (en) * 2008-07-24 2008-12-24 安徽大学 Quotient space overlay model for calculating network shortest path and building method thereof
CN103209094A (en) * 2013-03-11 2013-07-17 中国科学院信息工程研究所 System and method for fault positioning on basis of events
CN103914064A (en) * 2014-04-01 2014-07-09 浙江大学 Industrial process fault diagnosis method based on multiple classifiers and D-S evidence fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
IRINA RISH: "Adaptive Diagnosis in Distributed Systems", 《IEEE TRANSACTIONS ON NEURAL NETWORKS,》 *
李建新: "求最大完全子图的启发式着色算法", 《滁州学院学报》 *

Also Published As

Publication number Publication date
CN109474445B (en) 2022-08-19

Similar Documents

Publication Publication Date Title
CN107291911A (en) A kind of method for detecting abnormality and device
CN109840157A (en) Method, apparatus, electronic equipment and the storage medium of fault diagnosis
CN106165345A (en) Mark is for dissolving the failture evacuation option of network failure
CN106776208B (en) Fault Locating Method when a kind of running software
Friederich et al. Towards data-driven reliability modeling for cyber-physical production systems
CN107168842A (en) Adaptive sequential fault diagnosis method based on pmc model
CN107291063A (en) Diagnostic device and diagnostic method for the operation of monitoring technology facility
CN113391943A (en) Micro-service fault root cause positioning method and device based on cause and effect inference
Lazarova-Molnar et al. Data-driven fault tree modeling for reliability assessment of cyber-physical systems
Honsel et al. Mining software dependency networks for agent-based simulation of software evolution
CN107966648B (en) A kind of embedded failure diagnosis method based on correlation matrix
Atzmueller et al. Anomaly detection and structural analysis in industrial production environments
CN110472866A (en) A kind of work order quality inspection analysis method and device
Pauwels et al. Detecting and explaining drifts in yearly grant applications
CN117221087A (en) Alarm root cause positioning method, device and medium
Dhiman et al. A Clustered Approach to Analyze the Software Quality Using Software Defects
CN105159826B (en) A kind of method and apparatus of wrong sentence in positioning target program
CN109889258B (en) Optical network fault checking method and equipment
CN106569944A (en) Constraint-tree-based onboard software test data analysis method
CN109474445A (en) A kind of distributed system root Fault Locating Method and device
CN110188040A (en) A kind of software platform for software systems fault detection and health state evaluation
CN106487592B (en) A kind of Faults in Distributed Systems diagnostic method based on data cube
CN113033845B (en) Construction method and device for power transmission resource co-construction and sharing
CN114385403A (en) Distributed cooperative fault diagnosis method based on double-layer knowledge graph framework
KR102217092B1 (en) Method and apparatus for providing quality information of application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant