CN108933694A - Data center network Fault Node Diagnosis method and system based on testing data - Google Patents

Data center network Fault Node Diagnosis method and system based on testing data Download PDF

Info

Publication number
CN108933694A
CN108933694A CN201810603564.2A CN201810603564A CN108933694A CN 108933694 A CN108933694 A CN 108933694A CN 201810603564 A CN201810603564 A CN 201810603564A CN 108933694 A CN108933694 A CN 108933694A
Authority
CN
China
Prior art keywords
node
probability
network
data
malfunction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810603564.2A
Other languages
Chinese (zh)
Other versions
CN108933694B (en
Inventor
齐小刚
王冰纯
刘立芳
冯海林
胡绍林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201810603564.2A priority Critical patent/CN108933694B/en
Publication of CN108933694A publication Critical patent/CN108933694A/en
Application granted granted Critical
Publication of CN108933694B publication Critical patent/CN108933694B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/0636Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis based on a decision tree analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to supervise, monitoring or test device technical field, a kind of data center network Fault Node Diagnosis method and system based on testing data are disclosed, generate dynamic breadth First spanning tree as the detective path between node according to existing fault-finding information;The probability of malfunction of network members is primarily determined based on given prior probability p analysis testing data;It selects a reasonable threshold value to identify malfunctioning node by analysis probability distribution function, suspect node collection is classified as malfunctioning node collection and normal node collection.Compared with HFD algorithm, HBFD algorithm has better performance in terms of amount detection and diagnostic accuracy.The malfunctioning node in network can be accurately identified under lower detection times in the network topology of different scales.In order in diagnostic network malicious node or other kinds of failure, by new method introducing HBFD also there is certain researching value.

Description

Data center network Fault Node Diagnosis method and system based on testing data
Technical field
The invention belongs to supervise, monitoring or test device technical field more particularly to a kind of data based on testing data Central site network Fault Node Diagnosis method and system.
Background technique
Currently, the prior art commonly used in the trade is such:With the arrival of big data era, cloud computing demand it is continuous Increase so that data center network scale constantly expands.Nowadays, data center network passes through network interface card comprising hundreds of thousands of (NIC), the server that interchanger and router, cable and optical fiber connect, these servers are largely distributed and have There is high flow capacity.In large scale system, detection and positioning failure restore network by recovery mechanism for Network Management System It communicates extremely important.Although there are many researchs to be devoted to Fault Diagnosis Strategy, still there is problem as described below to need to solve.1) Diagnosis of complex:It is imperfect and uncertain due to dynamic other than bringing higher Time & Space Complexity for fault location Information, the increase of network size also results in more complicated fault diagnosis.Therefore, the detection times of fault diagnosis are effectively reduced It is very significant with detection efficient.2) network load increases:Data center can greatly shorten algorithm execution time, but also can A possibility that leading to controller expense increase.A kind of possible solution is to control the number of monitor using strategy is participated in Amount, and another strategy is to improve the validity of detection data to reduce data volume.Existing Network Fault Diagnosis Technique is main It is divided into three classes:Passive fault-finding, active fault-finding and the method for fault pattern recognition based on network log.Passive failure is visited Survey method monitors the real-time performance of network by disposing monitoring agent in a network, the passive state letter for obtaining network members Breath.A method of placing passive monitoring device on particular link in a network, these agency by within given time just All links in sharp network carry out the current state of the member of monitoring network, but this method has and generates in large scale network The shortcomings that redundancy monitor agent.A kind of passive method for diagnosing faults using dependency graph, but these algorithms can only once detect The malfunctioning node of limited quantity out is not suitable for the environment of large scale network.Bayesian belief networks (BBN) are also widely used In Fault Defection Technology.Network structure model is turned to directed acyclic graph by BBN, then by analyze end-to-end observable symptom come The node of failure is found in trial, and fault reasoning has higher time complexity in large scale network, so that Fault Management System The communication of network cannot timely and effectively be restored.Active fault diagnosis carrys out the situation of detection service device usually using detector, this Selected probe is transmitted to obtain statistical data, such as loss rate end to end, delay and handling capacity etc. a bit.Then controller These statistical informations are collected to obtain further information for active probe reasoning, design suitable effective exploration policy very It is important.A kind of fault diagnosis system framework using self-adaptive detection.Most of technologies based on probe include three components:It visits Survey station selection, detector selection and fault reasoning.But these methods are limited in large scale network by traffic overhead.One The test method stage by stage for kind reducing network flow expense, only uses a small group probe in detecting to the one small of network in each stage Partial region.But the problem of how finding reasonable Arrangement probe station and probe station failure need further to inquire into.A kind of probe Platform selection algorithm, to minimize the quantity of probe station and make probe station that there is robustness to fault-resistant.However, how to place detection The problem of standing to monitor failure station does not still solve.With the development of big data technology, the fault diagnosis based on daily record data Technology causes extensive concern.On the basis of technology based on network system log is typically established at thresholding algorithm, the algorithm The experience for being primarily based on network management personnel is that threshold value appropriate is arranged in the different detection performance of network, then by by actual value Detection failure is compared to default threshold.This technology it is very simple but have the shortcomings that two it is obvious:1) its threshold value is Rule of thumb select;2) data lower than threshold value are not analyzed, cause some details related with Network status may It is missed.For the new type analysis system of active fault diagnosis, the keyword of the abnormal logs such as mistake and decline is not only considered, also Attempt the mode of discovery catastrophic discontinuityfailure.But in data prediction, (such as data are extracted, data scrubbing based on the algorithm of data And abnormality processing) aspect have higher time complexity.
In conclusion problem of the existing technology is:
(1) passive fault detection method can generate redundancy monitor agent in large scale network, cause to exist in network and be permitted Mostly useless detection packet, when network size is larger, heavy traffic when, the detection packet of redundancy will affect the regular traffic of network, very To influence network fault diagnosis as a result, passive fault diagnosis is not suitable for the environment of large scale network.Therefore in large scale network In, active malfunction monitoring technology on the one hand can be used, the redundancy detection packet in network is effectively reduced.On the other hand it then needs to improve The validity of packet is detected, the radix for carrying out the probe of fault diagnosis detection is reduced.
(2) active fault diagnosis is limited in large scale network by traffic overhead, needs to place in a network rationally Effective detection base station, the position and quantity that detect base station directly affect the accuracy of fault diagnosis result, but existing grind Study carefully the problem of there is no detection base station and quantity.And in large scale network, the probe path for designing overlay network has very greatly Time complexity, also need to recalculate when network topology changes, be not suitable for dynamic network structure.
(3) decision threshold rule of thumb selects in the technology based on network system log;On the one hand due to the skill Art does not analyze the data lower than threshold value, and will lead to some details related with Network status may be missed.Work as network When unpredictable mutation occurs for state, empirical value is unable to judge accurately the current state of network, leads to network failure management System can not obtain the fault message in network.And analyze in terms of all fault datas then there is time complexity is larger, it is superfluous The more problem of remaining information.
Solve the difficulty and meaning of above-mentioned technical problem:In large-scale data central site network, passive fault detection technique The Shortcomings in terms of real-time and validity, and active fault detection technique then has how to select detection base station and detection road The problem of diameter.For large-scale complex network, solve the problems, such as that the problem of sending traverses network path of probe is NP-hard, Each network change requires to recalculate, and all there is significant limitation for network topology reconstruct and optimization etc., therefore seek Look for a kind of new reasonable effective fault diagnosis model be in engineering very it is necessary to.On the other hand, judging network Node whether failure when, there is also significant limitations for the method for traditional dependence artificial experience, therefore establish suitable model Selecting suitable threshold value for different network structures is also very to have research significance.
Summary of the invention
In view of the problems of the existing technology, the present invention provides a kind of data center network failures based on testing data Node diagnosis method and system.
The invention is realized in this way a kind of data center network Fault Node Diagnosis method based on testing data, institute It is wide according to existing fault-finding information generation dynamic to state the data center network Fault Node Diagnosis method based on testing data First spanning tree is spent as the detective path between node;Net is primarily determined based on given prior probability p analysis testing data The probability of malfunction of network member;A reasonable threshold value is selected to identify malfunctioning node by analysis probability distribution function, it will be suspicious Node collection is classified as malfunctioning node collection and normal node collection.
Further, the data center network Fault Node Diagnosis method based on testing data gives a dynamic generation It sets, the test mode s between any pair of adjacent nodeiBy two indicator variable s={ rij,rjiComposition, wherein rij(rji) be The result of node i (j) test node j (i);rij=0 means that node j is identified as normal node, r by node iij=1 is meaned Node j malfunction is identified as by node i;siThe matrix of composition is known as symptom matrix S.
Further, the data center network Fault Node Diagnosis method based on testing data specifically includes:Dynamic is raw Cheng Shu, probability of malfunction assessment and fault reasoning;
Dynamic spanning tree search is dynamically generated detection tree according to the detection result of last time;
The probability of malfunction of each suspect node of probability of malfunction project evaluation chain;
Absolute failure node is put into failure group by fault reasoning;Suitable threshold value is selected according to probability of malfunction table, it will be suspicious Node division is Relative fault group and normal group relatively.
Further, the dynamic spanning tree search is based on heuristic breadth-first search, and wherein N is one in network Group node, NF are normal node set, and F is malfunctioning node set;
Step 1, F ← the last time detection result;
Step 2,Turn to step 3;Else, turn to step 4;
Step 3, Find the breadth-first spanning tree by classical algorithm;
Step 4, NF ← N-F, useNF as the initial searching nodes。
Further, the probability of malfunction that the probability of malfunction assesses each node is obtained according to probability of malfunction table, using decision Function ψcDetermine the final probability of malfunction of each node in primary detection;
f(nj)=max { eij|eij∈E}(nj∈N)
Obtain the unique probability of each node failure;Pass through the malfunctioning node that the determination of fault reasoning is last.
Further, the fault reasoning thinks that the node is malfunctioning node when the probability of malfunction of node is greater than 0.5.
Another object of the present invention is to provide the data center network failure sections described in a kind of realize based on testing data The data center network Fault Node Diagnosis system based on testing data of point diagnostic method, the data based on testing data Central site network Fault Node Diagnosis system includes:
First spanning tree module, for generating dynamic breadth First spanning tree as section according to existing fault-finding information Detective path between point;
Probability of malfunction determining module, for primarily determined based on given prior probability p analysis testing data network at The probability of malfunction of member;
Categorization module will for selecting a reasonable threshold value to identify malfunctioning node by analysis probability distribution function Suspect node collection is classified as malfunctioning node collection and normal node collection.
Another object of the present invention is to provide the data center network failure sections described in a kind of application based on testing data The data centre network system of point diagnostic method.
In conclusion advantages of the present invention and good effect are:Using point-to-point Detection Techniques, avoid in a network The problem of selection detection base station number and placement detection base station, as long as there are reliable data processing centres in network.It visits In terms of surveying path, the larger feature of combined data central site network interior joint degree generates breadth First by breadth first algorithm and searches Rope path obtains testing data, reduces because of computing cost caused by network topology reconstructs and optimizes.Join probability calculating is set Effective probability distribution function is counted, reasonable judgment threshold is selected on the basis of probability distribution function, is effectively prevented artificial The influence of experience.The experimental results showed that HBFD algorithm has more in terms of amount detection and diagnostic accuracy compared with HFD algorithm Good performance.The failure section in network can be accurately identified under lower detection times in the network topology of different scales Point.It is a kind of fault diagnosis technology being more suitable in large-scale data central site network.
Detailed description of the invention
Fig. 1 is the data center network Fault Node Diagnosis method flow provided in an embodiment of the present invention based on testing data Figure.
Fig. 2 is that the data center network Fault Node Diagnosis method provided in an embodiment of the present invention based on testing data is realized Flow chart.
Fig. 3 is fault-finding schematic diagram provided in an embodiment of the present invention.
Fig. 4 is HBFD algorithm provided in an embodiment of the present invention and HFD algorithm comparing result schematic diagram.
Fig. 5 is failure particle size influences schematic diagram provided in an embodiment of the present invention.
Fig. 6 is that network size provided in an embodiment of the present invention influences schematic diagram.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
Rapid growth of the data center network in terms of scale and structural complexity is faced with Network Management System more Carry out bigger challenge.Since network failure (node failure or link failure) is inevitable, the event of quick diagnosis network is found The method of barrier is so as to effectively restoring the important subject that network communicating function has become academia and industry.The present invention The node failure of fast and effective monitoring data central site network, by 1) generating on the basis of heuristic breadth First diagnosis algorithm News Search tree, 2) it analyzes testing data and 3) reasonable threshold value is selected to determine malfunctioning node.Simulation result shows HBFD algorithm Can effectively diagnosis node failure, and detection number and false alarm are significantly reduced in the case where guaranteeing diagnostic accuracy Rate.
As shown in Figure 1, the data center network Fault Node Diagnosis side provided in an embodiment of the present invention based on testing data Method includes the following steps:
S101:Dynamic breadth First spanning tree is generated as the detection road between node according to existing fault-finding information Diameter;
S102:The probability of malfunction of network members is primarily determined based on given prior probability p analysis testing data;
S103:A reasonable threshold value is selected to identify malfunctioning node by analysis probability distribution function, by suspect node Collection is classified as malfunctioning node collection and normal node collection.
HBF mainly includes three parts:Dynamic spanning tree
Application principle of the invention is further described with reference to the accompanying drawing.
1, algorithm is summarized
The present invention is reflected reality the network in the world by non-directed graph, is made of a group node N, these nodes are by one group of chain Road L is connected, and data center network has the characteristics that high connectivity.In view of current technical level allows in network completely Unit be in communication with each other;The present invention obtains the real time status information of node by mutually testing between adjacent node.The detection Method algorithm for design first quickly and effectively detects path for selecting, to reach the mesh for saving Diagnostic Time and resource consumption 's.Accordingly, it is considered to the characteristics of arriving data center network high connectivity, the present invention is first depending on existing fault-finding information and generates Dynamic breadth First spanning tree is then based on given prior probability p analysis testing data and comes as the detective path between node The probability of malfunction for primarily determining network members selects a reasonable threshold value finally by analysis probability distribution function to identify event Hinder node, suspect node collection is classified as malfunctioning node collection and normal node collection.
HBFD mainly includes three parts:Dynamic spanning tree (DSTS), probability of malfunction assess (FPE) and fault reasoning (FR).The main thought of each process is briefly described below.
Define 1:Give a dynamic spanning tree, the test mode s between any pair of adjacent nodei(also referred to as symptom) By two indicator variable s={ rij,rjiComposition, wherein rij(rji) be node i (j) test node j (i) result.Here, rij =0 means that node j is identified as normal node by node i, and rij=1 then means that node j is identified as malfunction by node i. siThe matrix of composition is known as symptom matrix S.
Dynamic spanning tree search (DSTS) is dynamically generated detection tree according to the detection result of last time.Therefore HBFD can use up It possibly avoids detecting other suspect nodes using malfunctioning node, effectively by the uncertainty of detection result in the dust (for example, when event When barrier node i detects other nodes, there is the possibility that malfunctioning node is labeled as to normal condition, the detection result for being in symptomatic consequence There is biggish false alarm rate).
Probability of malfunction assessment (FPE) is used to quantify the probability of malfunction of each suspect node.As shown in figure 3, the failure of node n Probability depends not only on its prior probability (probability that malfunctioning node correctly diagnoses), also related to the diagnostic result of other nodes. FPE obtains the Preliminary detection probability of malfunction of each node according to probability of malfunction table, then most by decision function calculate node Whole probability of malfunction.Final probability of malfunction is higher, and node n is more likely to occur failure.
Finally, the probability of malfunction of all suspect nodes is partially further analyzed by fault reasoning (FR).Firstly, FR will Absolute failure node (its probability of malfunction is equal to 1) is put into failure group.Then, FR selects suitable threshold value according to probability of malfunction table, Suspect node is divided into Relative fault group (F) and relatively normal group (NF).
In view of HBFD algorithm is for the applicability of the data center network of different structure, the present invention considers the random company of generation Logical network topology.The present invention is it is also assumed that be not present malicious node (with 1 generation error information of probability) in network, and at least One Management Controller trusty is for collecting dialup test data and executing HBFD algorithm.
2, heuristic fault diagnosis
It is of the invention completely eliminate any particular network topology it is assumed that although can for different network topology structures There can be specific optimization.The present invention does not also consider malicious node and link failure, and thinks in different data center networks Deployment HBFD algorithm is feasible in topological (such as VL2, FiCoon and DCell).Although HBFD is a logic single entities, But it still can be realized in a distributed fashion by the dialup test data that analysis distribution formula stores.The present invention is it is also supposed that net The believable Management Controller of at least one in network (AC) is for collecting and analyzing dialup test data.
2.1 case study
Failure is to cause the network event of network communication problem root.The meeting when equipment cannot be used for routing or converting flow Node failure occurs.Node failure may be as caused by many factors, for example, since hardware error causes device powers down to be repaired Or collapse, packet loss or time-out response, all these failures all may cause the uncertain of testing result when network flow is excessive.
One or more data packets are sent the node in network by active probe, to detect the real-time shape of each node State.It is defined according to pmc model, one group of test node can generate 6 different detection results:1) the test knot of two normal nodes Fruit is that two nodes are in normal operating conditions (for example, rij=0, rji=0);2) when normal node detects detection failure section Point when, as a result must be the node be malfunctioning node (for example, rij=0, rji=1);3) -6) no matter it is tested the state of node such as What, the detection knot of malfunctioning node is uncertain (for example, r={ 0,1 }).The behavior of malfunctioning node is defined as disease by the present invention Shape (for example, r=1).Fig. 3 provides two simply examples to illustrate to be caused by malfunctioning node in network to detect uncertain original Cause.
Example 1:In Fig. 3-a, { a, b, c } is three nodes of network, it is assumed that node a and c are two malfunctioning nodes, node B is normal node.The detection result of so { a, b } is any one gathered in { (0,1), (1,1) }, and the result of { b, c } It is then one for gathering { (1,0), (1,1) }.When symptom group is { (0,1), (1,0) }, it is believed that node b is in shape; But when symptom group is { (1,1), (1,1) }, it is believed that node b is malfunctioning node, remaining combination is then not enough to determine the shape of node b Condition.
Example 2:In Fig. 3-b, when node a breaks down, no matter the state of node { b, c, d } how, this four nodes State be all uncertain.
Define 2:Malfunctioning node obtains correct testing result according to given probability, defines probability prior probability p.
As given prior probability p, for malfunctioning node ni, when it detects normal node, it obtains r with Probability pij =0 symptom, and r is obtained with probability 1-pij=1 symptom.
In the case where given network, when network state (for example, connectivity, time delay and packet loss rate) relatively When good, prior probability p higher, so that dialup test data will be more effective.Detection result is entirely accurate as p=1, but it It exists only in ideally.
The search of 2.2 dynamic spanning trees
For detection node failure, the characteristics of combined data central site network high connectivity, the present invention combines current detection Information generates the path that breadth First spanning tree is mutually detected as node.Even if introducing prior probability to quantify detection result Uncertainty, the testing result of malfunctioning node remain on be it is uncertain, detect obtained testing data deficiencies once to find All malfunctioning nodes.A kind of point-to-point fault diagnosis algorithm being known as HFD, finds most of malfunctioning nodes by repeated detection, The algorithm does not rationally reduce the uncertainty of each detection result using the status information obtained after each detection, so that visiting It is larger to survey number, is not suitable for large-scale data center network.
Data center network has the characteristics that high connectivity.Therefore, excellent using range for large-scale data central site network First search (BFS) algorithm can effectively improve the speed of searching detective path to search spanning tree.BFS is also it is possible to prevente effectively from single One diagnoses and causes the higher situation of false alarm rate, as shown in Figure 3.Based on BFS, the present invention devises one kind as shown in algorithm 1 Heuristic breadth-first search (HBFS), wherein N is the group node in network, and NF is normal node set, and F is failure Node set.
Heuristic breadth-first search is dynamically using the absolute failure node (event for including in a preceding detection result Barrier probability is information 1) to find the newly-generated tree for avoiding detecting other nodes using absolute failure node.It is detected by reducing As a result with the uncertainty of number of communications, HBFS algorithm effectively increases the accuracy and speed of detection of detection.
The assessment of 2.3 probabilities of malfunction
For the network topology structure generated at random, testing result 0 is defined as the node and is in symptom state, as a result 1 Then indicate that detected node is malfunctioning node.Then detection result shown in table 1 and its corresponding probability of malfunction are obtained.
Table 1:Probability of malfunction table
The probability of malfunction of each node can be obtained according to probability of malfunction table.In the base of breadth First spanning tree searching route On plinth, the corresponding detection result of each node is uncertain.Therefore, present invention design decision letter on the basis of probability of malfunction Number (DF) ψcTo determine the final probability of malfunction of each node in once detecting.
Pass through the unique probability of the available each node failure of DF.Finally by the failure that the determination of fault reasoning is last Node.
2.4 fault reasoning
After obtaining the final probability of malfunction of each node, first by absolute failure node (for example, ψc()=1) it is put into event Hinder node set (F), then selects two suitable threshold values to determine other malfunctioning nodes according to probability of malfunction table.For table 1 In four kinds of detection results, have analysis as described below:
1) result of Num.1 indicates that the node detected mutually is in same state.When priori Probability p >=0.3, the two The probability of malfunction of node is below 0.5, then it is assumed that two nodes are all in normal condition.When priori Probability p<When 0.3, failure is general Rate is higher than 0.5.This may cause decision function that higher probability of malfunction is distributed to normal probe node, also mean that Deceptive information is provided there are malicious node to interfere failure diagnostic process.
2) detection result of Num.2 and Num.3 shows that there are absolute failure node (i.e. detection failures between detection node Probability is that 1), regardless of prior probability p value, the probability of malfunction of another node is all very low, therefore, it is considered that probability of malfunction is 1 Node is absolute failure node, another node is in normal condition.
3) probability as the result is shown of Num.4 is always above 0.5, is mutually measured in node and contains at least one node failure, but nothing Method determines, therefore is mutually measured node for two and is collectively labeled as suspect node and carries out detection next time making further determine.
In conclusion thinking that the node is malfunctioning node, prior probability p when the probability of malfunction of node is greater than 0.5<0.3. Next time detect when in conjunction with upper primary detection result analysis, effectively avoid using malfunctioning node come and meanwhile detect it is multiple other Network members reduce the uncertainty of detection result to improve speed of detection.
Application effect of the invention is explained in detail below with reference to emulation.
1, as the active point-to-point fault diagnosis algorithm based on data center network dialup test data, HBFD can spirit Deployment living is simultaneously integrated with existing routing protocol, to improve the accuracy of fault diagnosis, reduces monitoring cost.Simulation result and Ben Fa The performance of the algorithm of bright proposition.
Evaluation index includes:(1) detection times (DN);(2) failure granularity (FG);(3) correct diagnosis (CDR);(4) accidentally Alert rate (FDR).Detection times are used to measure the number of communications that testing data are obtained between fault-finding process interior joint.DN value is got over Low algorithm is better.PF indicates real malfunctioning node.Failure granularity is the ratio of network size and network size.
Correct diagnosis and the diagnosis of mistake are two evaluation indexes for measuring fault diagnosis algorithm accuracy, are write At:
2, simulation result
In view of the applicability of HBFD algorithm, generates network topology and randomly choose malfunctioning node, simulation result such as Fig. 4- Shown in Fig. 6, the performance of HBFD algorithm will be analyzed in terms of three.1) it is calculated the advantages of HBFD with hierarchical fault diagnosis (HFD) Method is compared, and the advantage of HBFD algorithm is as shown in Figure 4.In the network topology structure comprising 5000 nodes, Prior Probability is 0.3, the range for detecting granularity is [0.05,0.5].This it appears that HBFD algorithm improves detection number well from figure With correct diagnosis.It can effectively improve the accuracy of diagnosis this also means that avoiding detecting other nodes using malfunctioning node, And effectively reduce invalid communication.But have one disadvantage in that, HBFD may result in low-down false alarm as shown in Figure 4 Rate analyzes the reason of causing false alarm rate for below.
2) influence of failure granularity
As shown in figure 5, being arranged FG in the range of [0.1,0.4].The accuracy of diagnosis is still highly stable.But with The increase of failure granularity, false alarm rate are increased slightly.This is because more malfunctioning nodes in higher failure granularity instruction network. HBFS can not find the dynamic spanning tree of not malfunctioning node.As described in table 1, the 4th kind of symptom occurs with higher probability. Therefore, several normal nodes are confirmed as malfunctioning node.
3) influence of network size
Different network magnitude range [1000,5000] and identical failure granularity 0.5 are provided, as shown in Figure 6.It is very aobvious So, correct diagnosis is stable in the network of different scales, and with the increase of network size, false detection rate Declining.These results indicate that HBFD has a very strong robustness in the network of different scales, and false diagnosis compared with It is low.
Of the invention is detected and is positioned most probable malfunctioning node based on the point-to-point fault diagnosis algorithm of testing data; Compared with HFD algorithm, HBFD algorithm has better performance in terms of amount detection and diagnostic accuracy.In the network of different scales The malfunctioning node in network can be accurately identified in topology under lower detection times.In subsequent work, on the one hand after It is continuous to analyze the basic reason for causing algorithm false alarm rate, attempt the false alarm rate for eliminating HBFD algorithm;On the other hand consider that combination is trueer Real network topology structure, HBFD algorithm is combined with different transport protocols.In order in diagnostic network malicious node or New method introducing HBFD is also had certain value by other kinds of failure.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (8)

1. a kind of data center network Fault Node Diagnosis method based on testing data, which is characterized in that described to be based on testing The data center network Fault Node Diagnosis method of data generates dynamic breadth First according to existing fault-finding information and generates Tree is as the detective path between node;The event of network members is primarily determined based on given prior probability p analysis testing data Hinder probability;It selects a reasonable threshold value to identify malfunctioning node by analysis probability distribution function, suspect node collection is classified For malfunctioning node collection and normal node collection.
2. the data center network Fault Node Diagnosis method based on testing data as described in claim 1, which is characterized in that The data center network Fault Node Diagnosis method based on testing data gives a dynamic spanning tree, and any pair adjacent Test mode s between nodeiBy two indicator variable s={ rij,rjiComposition, wherein rij(rji) it is node i (j) test section The result of point j (i);rij=0 means that node j is identified as normal node, r by node iij=1 means node j by node i It is identified as malfunction;siThe matrix of composition is known as symptom matrix S.
3. the data center network Fault Node Diagnosis method based on testing data as described in claim 1, which is characterized in that The data center network Fault Node Diagnosis method based on testing data specifically includes:Dynamic spanning tree, probability of malfunction are commented Estimate and fault reasoning;
Dynamic spanning tree search is dynamically generated detection tree according to the detection result of last time;
The probability of malfunction of each suspect node of probability of malfunction project evaluation chain;
Absolute failure node is put into failure group by fault reasoning;Suitable threshold value is selected according to probability of malfunction table, by suspect node It is divided into Relative fault group and relatively normal group.
4. the data center network Fault Node Diagnosis method based on testing data as claimed in claim 3, which is characterized in that The dynamic spanning tree search is based on heuristic breadth-first search, and wherein N is the group node in network, and NF is normal Node set, F are malfunctioning node set;
Step 1, F ← the last time detection result;
Step 2,Turn to step 3;Else, turn to step 4;
Step 3, Find the breadth-first spanning tree by classical algorithm;
Step 4, NF ← N-F, use NF as the initial searching nodes.
5. the data center network Fault Node Diagnosis method based on testing data as claimed in claim 3, which is characterized in that The probability of malfunction that the probability of malfunction assesses each node is obtained according to probability of malfunction table, using decision function ψcDetermine primary visit The final probability of malfunction of each node in survey;
Obtain the unique probability of each node failure;Pass through the malfunctioning node that the determination of fault reasoning is last.
6. the data center network Fault Node Diagnosis method based on testing data as claimed in claim 3, which is characterized in that The fault reasoning thinks that the node is malfunctioning node when the probability of malfunction of node is greater than 0.5.
7. a kind of data center network Fault Node Diagnosis method realized described in claim 1 based on testing data based on dial The data center network Fault Node Diagnosis system of measured data, which is characterized in that data center's net based on testing data Network Fault Node Diagnosis system includes:
First spanning tree module, for generating dynamic breadth First spanning tree as between node according to existing fault-finding information Detective path;
Probability of malfunction determining module, for primarily determining network members based on given prior probability p analysis testing data Probability of malfunction;
Categorization module will be suspicious for selecting a reasonable threshold value to identify malfunctioning node by analysis probability distribution function Node collection is classified as malfunctioning node collection and normal node collection.
8. it is a kind of using described in claim 1~6 any one based on the data center network Fault Node Diagnosis of testing data The data centre network system of method.
CN201810603564.2A 2018-06-09 2018-06-09 Data center network fault node diagnosis method and system based on dial testing data Active CN108933694B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810603564.2A CN108933694B (en) 2018-06-09 2018-06-09 Data center network fault node diagnosis method and system based on dial testing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810603564.2A CN108933694B (en) 2018-06-09 2018-06-09 Data center network fault node diagnosis method and system based on dial testing data

Publications (2)

Publication Number Publication Date
CN108933694A true CN108933694A (en) 2018-12-04
CN108933694B CN108933694B (en) 2021-11-09

Family

ID=64446368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810603564.2A Active CN108933694B (en) 2018-06-09 2018-06-09 Data center network fault node diagnosis method and system based on dial testing data

Country Status (1)

Country Link
CN (1) CN108933694B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109802855A (en) * 2018-12-28 2019-05-24 华为技术有限公司 A kind of Fault Locating Method and device
CN110247826A (en) * 2019-07-10 2019-09-17 上海理工大学 A kind of point to point network continuity testing method
CN111327491A (en) * 2020-01-20 2020-06-23 上海市大数据中心 Server-centered pessimistic diagnosis method for data center network
CN113595810A (en) * 2021-06-17 2021-11-02 国网上海能源互联网研究院有限公司 Interactive testing method and system suitable for power distribution network information
CN113608072A (en) * 2021-10-06 2021-11-05 深圳市景星天成科技有限公司 Electric power self-healing rapid fault positioning method based on non-sound condition
CN114978794A (en) * 2022-05-19 2022-08-30 北京有竹居网络技术有限公司 Network access method, device, storage medium and electronic equipment
CN115914009A (en) * 2021-08-10 2023-04-04 中国移动通信集团江苏有限公司 ToB private network service quality testing method and system
CN117114102A (en) * 2023-10-13 2023-11-24 江苏前景瑞信科技发展有限公司 Transformer fault diagnosis method based on Bayesian network and fault tree
CN117955850A (en) * 2023-07-31 2024-04-30 非凡士智能科技(苏州)有限公司 Method for detecting networking performance of Internet of things system and improving stability
CN118473910A (en) * 2024-07-08 2024-08-09 鄂尔多斯市泛胜数据技术有限公司 Electric power Internet of things fault detection method and system based on edge cloud cooperation
CN115914009B (en) * 2021-08-10 2024-10-22 中国移动通信集团江苏有限公司 ToB private network service quality testing method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101715203A (en) * 2009-11-30 2010-05-26 中国移动通信集团浙江有限公司 Method and device for automatically positioning fault points
CN103856789A (en) * 2014-03-13 2014-06-11 赛特斯信息科技股份有限公司 System and method for achieving OTT service quality guarantee based on user behavior analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101715203A (en) * 2009-11-30 2010-05-26 中国移动通信集团浙江有限公司 Method and device for automatically positioning fault points
CN103856789A (en) * 2014-03-13 2014-06-11 赛特斯信息科技股份有限公司 System and method for achieving OTT service quality guarantee based on user behavior analysis

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
QI XIAOGANG等: "Fault diagnosis based on dial-test data in datacenter networks", 《JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS》 *
RONG ZHOU等: "Breadth-first heuristic search", 《ARTIFICIAL INTELLIGENCE》 *
党克等: "配电网计划孤岛划分方法研究 ", 《中国电力》 *
王伟等: "多值属性系统的故障诊断策略最优化方法 ", 《仪器仪表学报》 *
王冰纯: "基于数据析的网络诊断算法研究", 《CNKI优秀硕士学位论文全文数据库》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109802855A (en) * 2018-12-28 2019-05-24 华为技术有限公司 A kind of Fault Locating Method and device
CN110247826A (en) * 2019-07-10 2019-09-17 上海理工大学 A kind of point to point network continuity testing method
CN110247826B (en) * 2019-07-10 2022-03-25 上海理工大学 Point-to-point network connectivity test method
CN111327491A (en) * 2020-01-20 2020-06-23 上海市大数据中心 Server-centered pessimistic diagnosis method for data center network
CN113595810A (en) * 2021-06-17 2021-11-02 国网上海能源互联网研究院有限公司 Interactive testing method and system suitable for power distribution network information
CN113595810B (en) * 2021-06-17 2023-09-26 国网上海能源互联网研究院有限公司 Interactive testing method and system suitable for power distribution network information
CN115914009A (en) * 2021-08-10 2023-04-04 中国移动通信集团江苏有限公司 ToB private network service quality testing method and system
CN115914009B (en) * 2021-08-10 2024-10-22 中国移动通信集团江苏有限公司 ToB private network service quality testing method and system
CN113608072A (en) * 2021-10-06 2021-11-05 深圳市景星天成科技有限公司 Electric power self-healing rapid fault positioning method based on non-sound condition
CN113608072B (en) * 2021-10-06 2021-12-28 深圳市景星天成科技有限公司 Electric power self-healing rapid fault positioning method based on non-sound condition
CN114978794B (en) * 2022-05-19 2023-06-23 北京有竹居网络技术有限公司 Network access method, device, storage medium and electronic equipment
CN114978794A (en) * 2022-05-19 2022-08-30 北京有竹居网络技术有限公司 Network access method, device, storage medium and electronic equipment
CN117955850A (en) * 2023-07-31 2024-04-30 非凡士智能科技(苏州)有限公司 Method for detecting networking performance of Internet of things system and improving stability
CN117114102A (en) * 2023-10-13 2023-11-24 江苏前景瑞信科技发展有限公司 Transformer fault diagnosis method based on Bayesian network and fault tree
CN118473910A (en) * 2024-07-08 2024-08-09 鄂尔多斯市泛胜数据技术有限公司 Electric power Internet of things fault detection method and system based on edge cloud cooperation
CN118473910B (en) * 2024-07-08 2024-09-10 鄂尔多斯市泛胜数据技术有限公司 Electric power Internet of things fault detection method and system based on edge cloud cooperation

Also Published As

Publication number Publication date
CN108933694B (en) 2021-11-09

Similar Documents

Publication Publication Date Title
CN108933694A (en) Data center network Fault Node Diagnosis method and system based on testing data
CN104270268B (en) A kind of distributed system network performance evaluation and method for diagnosing faults
CN1925437B (en) System and method for detecting status changes in a network
CN104796273B (en) A kind of method and apparatus of network fault root diagnosis
CN115118581B (en) Internet of things data all-link monitoring and intelligent guaranteeing system based on 5G
EP2286337A2 (en) Ranking the importance of alerts for problem determination in large systems
CN109039763A (en) A kind of network failure nodal test method and Network Management System based on backtracking method
CN102684902B (en) Based on the network failure locating method of probe prediction
CN104935458B (en) A kind of performance bottleneck analysis method and device based on distributed automatization measurement
CN115237717A (en) Micro-service abnormity detection method and system
CN103023028A (en) Rapid grid failure positioning method based on dependency graph of entities
CN111884859B (en) Network fault diagnosis method and device and readable storage medium
CN112367191A (en) Service fault positioning method under 5G network slice
CN107147534A (en) A kind of probe deployment method of quantity optimization for power telecom network fault detect
CN112383934A (en) Multi-domain cooperation service fault diagnosis method under 5G network slice
CN111600805A (en) Bayes-based power data network congestion link inference algorithm
CN110557275B (en) Electric power communication network detection station selection algorithm based on network intrinsic characteristics
Nie et al. Passive diagnosis for WSNs using data traces
CN102281103A (en) Optical network multi-fault recovering method based on fuzzy set calculation
CN113890820A (en) Data center network fault node diagnosis method and system
Xu et al. Distributed fault diagnosis of wireless sensor networks
CN116011813A (en) Urban rail transit emergency monitoring method and device, electronic equipment and storage medium
CN115664928A (en) Interpretable graph-based root cause positioning method and device
CN117376084A (en) Fault detection method, electronic equipment and medium thereof
Patil et al. Probe station placement algorithm for probe set reduction in network fault localization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant