CN108933694A - Data center network Fault Node Diagnosis method and system based on testing data - Google Patents
Data center network Fault Node Diagnosis method and system based on testing data Download PDFInfo
- Publication number
- CN108933694A CN108933694A CN201810603564.2A CN201810603564A CN108933694A CN 108933694 A CN108933694 A CN 108933694A CN 201810603564 A CN201810603564 A CN 201810603564A CN 108933694 A CN108933694 A CN 108933694A
- Authority
- CN
- China
- Prior art keywords
- node
- probability
- network
- data
- malfunction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/0636—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis based on a decision tree analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention belongs to supervise, monitoring or test device technical field, a kind of data center network Fault Node Diagnosis method and system based on testing data are disclosed, generate dynamic breadth First spanning tree as the detective path between node according to existing fault-finding information;The probability of malfunction of network members is primarily determined based on given prior probability p analysis testing data;It selects a reasonable threshold value to identify malfunctioning node by analysis probability distribution function, suspect node collection is classified as malfunctioning node collection and normal node collection.Compared with HFD algorithm, HBFD algorithm has better performance in terms of amount detection and diagnostic accuracy.The malfunctioning node in network can be accurately identified under lower detection times in the network topology of different scales.In order in diagnostic network malicious node or other kinds of failure, by new method introducing HBFD also there is certain researching value.
Description
Technical field
The invention belongs to supervise, monitoring or test device technical field more particularly to a kind of data based on testing data
Central site network Fault Node Diagnosis method and system.
Background technique
Currently, the prior art commonly used in the trade is such:With the arrival of big data era, cloud computing demand it is continuous
Increase so that data center network scale constantly expands.Nowadays, data center network passes through network interface card comprising hundreds of thousands of
(NIC), the server that interchanger and router, cable and optical fiber connect, these servers are largely distributed and have
There is high flow capacity.In large scale system, detection and positioning failure restore network by recovery mechanism for Network Management System
It communicates extremely important.Although there are many researchs to be devoted to Fault Diagnosis Strategy, still there is problem as described below to need to solve.1)
Diagnosis of complex:It is imperfect and uncertain due to dynamic other than bringing higher Time & Space Complexity for fault location
Information, the increase of network size also results in more complicated fault diagnosis.Therefore, the detection times of fault diagnosis are effectively reduced
It is very significant with detection efficient.2) network load increases:Data center can greatly shorten algorithm execution time, but also can
A possibility that leading to controller expense increase.A kind of possible solution is to control the number of monitor using strategy is participated in
Amount, and another strategy is to improve the validity of detection data to reduce data volume.Existing Network Fault Diagnosis Technique is main
It is divided into three classes:Passive fault-finding, active fault-finding and the method for fault pattern recognition based on network log.Passive failure is visited
Survey method monitors the real-time performance of network by disposing monitoring agent in a network, the passive state letter for obtaining network members
Breath.A method of placing passive monitoring device on particular link in a network, these agency by within given time just
All links in sharp network carry out the current state of the member of monitoring network, but this method has and generates in large scale network
The shortcomings that redundancy monitor agent.A kind of passive method for diagnosing faults using dependency graph, but these algorithms can only once detect
The malfunctioning node of limited quantity out is not suitable for the environment of large scale network.Bayesian belief networks (BBN) are also widely used
In Fault Defection Technology.Network structure model is turned to directed acyclic graph by BBN, then by analyze end-to-end observable symptom come
The node of failure is found in trial, and fault reasoning has higher time complexity in large scale network, so that Fault Management System
The communication of network cannot timely and effectively be restored.Active fault diagnosis carrys out the situation of detection service device usually using detector, this
Selected probe is transmitted to obtain statistical data, such as loss rate end to end, delay and handling capacity etc. a bit.Then controller
These statistical informations are collected to obtain further information for active probe reasoning, design suitable effective exploration policy very
It is important.A kind of fault diagnosis system framework using self-adaptive detection.Most of technologies based on probe include three components:It visits
Survey station selection, detector selection and fault reasoning.But these methods are limited in large scale network by traffic overhead.One
The test method stage by stage for kind reducing network flow expense, only uses a small group probe in detecting to the one small of network in each stage
Partial region.But the problem of how finding reasonable Arrangement probe station and probe station failure need further to inquire into.A kind of probe
Platform selection algorithm, to minimize the quantity of probe station and make probe station that there is robustness to fault-resistant.However, how to place detection
The problem of standing to monitor failure station does not still solve.With the development of big data technology, the fault diagnosis based on daily record data
Technology causes extensive concern.On the basis of technology based on network system log is typically established at thresholding algorithm, the algorithm
The experience for being primarily based on network management personnel is that threshold value appropriate is arranged in the different detection performance of network, then by by actual value
Detection failure is compared to default threshold.This technology it is very simple but have the shortcomings that two it is obvious:1) its threshold value is
Rule of thumb select;2) data lower than threshold value are not analyzed, cause some details related with Network status may
It is missed.For the new type analysis system of active fault diagnosis, the keyword of the abnormal logs such as mistake and decline is not only considered, also
Attempt the mode of discovery catastrophic discontinuityfailure.But in data prediction, (such as data are extracted, data scrubbing based on the algorithm of data
And abnormality processing) aspect have higher time complexity.
In conclusion problem of the existing technology is:
(1) passive fault detection method can generate redundancy monitor agent in large scale network, cause to exist in network and be permitted
Mostly useless detection packet, when network size is larger, heavy traffic when, the detection packet of redundancy will affect the regular traffic of network, very
To influence network fault diagnosis as a result, passive fault diagnosis is not suitable for the environment of large scale network.Therefore in large scale network
In, active malfunction monitoring technology on the one hand can be used, the redundancy detection packet in network is effectively reduced.On the other hand it then needs to improve
The validity of packet is detected, the radix for carrying out the probe of fault diagnosis detection is reduced.
(2) active fault diagnosis is limited in large scale network by traffic overhead, needs to place in a network rationally
Effective detection base station, the position and quantity that detect base station directly affect the accuracy of fault diagnosis result, but existing grind
Study carefully the problem of there is no detection base station and quantity.And in large scale network, the probe path for designing overlay network has very greatly
Time complexity, also need to recalculate when network topology changes, be not suitable for dynamic network structure.
(3) decision threshold rule of thumb selects in the technology based on network system log;On the one hand due to the skill
Art does not analyze the data lower than threshold value, and will lead to some details related with Network status may be missed.Work as network
When unpredictable mutation occurs for state, empirical value is unable to judge accurately the current state of network, leads to network failure management
System can not obtain the fault message in network.And analyze in terms of all fault datas then there is time complexity is larger, it is superfluous
The more problem of remaining information.
Solve the difficulty and meaning of above-mentioned technical problem:In large-scale data central site network, passive fault detection technique
The Shortcomings in terms of real-time and validity, and active fault detection technique then has how to select detection base station and detection road
The problem of diameter.For large-scale complex network, solve the problems, such as that the problem of sending traverses network path of probe is NP-hard,
Each network change requires to recalculate, and all there is significant limitation for network topology reconstruct and optimization etc., therefore seek
Look for a kind of new reasonable effective fault diagnosis model be in engineering very it is necessary to.On the other hand, judging network
Node whether failure when, there is also significant limitations for the method for traditional dependence artificial experience, therefore establish suitable model
Selecting suitable threshold value for different network structures is also very to have research significance.
Summary of the invention
In view of the problems of the existing technology, the present invention provides a kind of data center network failures based on testing data
Node diagnosis method and system.
The invention is realized in this way a kind of data center network Fault Node Diagnosis method based on testing data, institute
It is wide according to existing fault-finding information generation dynamic to state the data center network Fault Node Diagnosis method based on testing data
First spanning tree is spent as the detective path between node;Net is primarily determined based on given prior probability p analysis testing data
The probability of malfunction of network member;A reasonable threshold value is selected to identify malfunctioning node by analysis probability distribution function, it will be suspicious
Node collection is classified as malfunctioning node collection and normal node collection.
Further, the data center network Fault Node Diagnosis method based on testing data gives a dynamic generation
It sets, the test mode s between any pair of adjacent nodeiBy two indicator variable s={ rij,rjiComposition, wherein rij(rji) be
The result of node i (j) test node j (i);rij=0 means that node j is identified as normal node, r by node iij=1 is meaned
Node j malfunction is identified as by node i;siThe matrix of composition is known as symptom matrix S.
Further, the data center network Fault Node Diagnosis method based on testing data specifically includes:Dynamic is raw
Cheng Shu, probability of malfunction assessment and fault reasoning;
Dynamic spanning tree search is dynamically generated detection tree according to the detection result of last time;
The probability of malfunction of each suspect node of probability of malfunction project evaluation chain;
Absolute failure node is put into failure group by fault reasoning;Suitable threshold value is selected according to probability of malfunction table, it will be suspicious
Node division is Relative fault group and normal group relatively.
Further, the dynamic spanning tree search is based on heuristic breadth-first search, and wherein N is one in network
Group node, NF are normal node set, and F is malfunctioning node set;
Step 1, F ← the last time detection result;
Step 2,Turn to step 3;Else, turn to step 4;
Step 3, Find the breadth-first spanning tree by classical algorithm;
Step 4, NF ← N-F, useNF as the initial searching nodes。
Further, the probability of malfunction that the probability of malfunction assesses each node is obtained according to probability of malfunction table, using decision
Function ψcDetermine the final probability of malfunction of each node in primary detection;
f(nj)=max { eij|eij∈E}(nj∈N)
Obtain the unique probability of each node failure;Pass through the malfunctioning node that the determination of fault reasoning is last.
Further, the fault reasoning thinks that the node is malfunctioning node when the probability of malfunction of node is greater than 0.5.
Another object of the present invention is to provide the data center network failure sections described in a kind of realize based on testing data
The data center network Fault Node Diagnosis system based on testing data of point diagnostic method, the data based on testing data
Central site network Fault Node Diagnosis system includes:
First spanning tree module, for generating dynamic breadth First spanning tree as section according to existing fault-finding information
Detective path between point;
Probability of malfunction determining module, for primarily determined based on given prior probability p analysis testing data network at
The probability of malfunction of member;
Categorization module will for selecting a reasonable threshold value to identify malfunctioning node by analysis probability distribution function
Suspect node collection is classified as malfunctioning node collection and normal node collection.
Another object of the present invention is to provide the data center network failure sections described in a kind of application based on testing data
The data centre network system of point diagnostic method.
In conclusion advantages of the present invention and good effect are:Using point-to-point Detection Techniques, avoid in a network
The problem of selection detection base station number and placement detection base station, as long as there are reliable data processing centres in network.It visits
In terms of surveying path, the larger feature of combined data central site network interior joint degree generates breadth First by breadth first algorithm and searches
Rope path obtains testing data, reduces because of computing cost caused by network topology reconstructs and optimizes.Join probability calculating is set
Effective probability distribution function is counted, reasonable judgment threshold is selected on the basis of probability distribution function, is effectively prevented artificial
The influence of experience.The experimental results showed that HBFD algorithm has more in terms of amount detection and diagnostic accuracy compared with HFD algorithm
Good performance.The failure section in network can be accurately identified under lower detection times in the network topology of different scales
Point.It is a kind of fault diagnosis technology being more suitable in large-scale data central site network.
Detailed description of the invention
Fig. 1 is the data center network Fault Node Diagnosis method flow provided in an embodiment of the present invention based on testing data
Figure.
Fig. 2 is that the data center network Fault Node Diagnosis method provided in an embodiment of the present invention based on testing data is realized
Flow chart.
Fig. 3 is fault-finding schematic diagram provided in an embodiment of the present invention.
Fig. 4 is HBFD algorithm provided in an embodiment of the present invention and HFD algorithm comparing result schematic diagram.
Fig. 5 is failure particle size influences schematic diagram provided in an embodiment of the present invention.
Fig. 6 is that network size provided in an embodiment of the present invention influences schematic diagram.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
Rapid growth of the data center network in terms of scale and structural complexity is faced with Network Management System more
Carry out bigger challenge.Since network failure (node failure or link failure) is inevitable, the event of quick diagnosis network is found
The method of barrier is so as to effectively restoring the important subject that network communicating function has become academia and industry.The present invention
The node failure of fast and effective monitoring data central site network, by 1) generating on the basis of heuristic breadth First diagnosis algorithm
News Search tree, 2) it analyzes testing data and 3) reasonable threshold value is selected to determine malfunctioning node.Simulation result shows HBFD algorithm
Can effectively diagnosis node failure, and detection number and false alarm are significantly reduced in the case where guaranteeing diagnostic accuracy
Rate.
As shown in Figure 1, the data center network Fault Node Diagnosis side provided in an embodiment of the present invention based on testing data
Method includes the following steps:
S101:Dynamic breadth First spanning tree is generated as the detection road between node according to existing fault-finding information
Diameter;
S102:The probability of malfunction of network members is primarily determined based on given prior probability p analysis testing data;
S103:A reasonable threshold value is selected to identify malfunctioning node by analysis probability distribution function, by suspect node
Collection is classified as malfunctioning node collection and normal node collection.
HBF mainly includes three parts:Dynamic spanning tree
Application principle of the invention is further described with reference to the accompanying drawing.
1, algorithm is summarized
The present invention is reflected reality the network in the world by non-directed graph, is made of a group node N, these nodes are by one group of chain
Road L is connected, and data center network has the characteristics that high connectivity.In view of current technical level allows in network completely
Unit be in communication with each other;The present invention obtains the real time status information of node by mutually testing between adjacent node.The detection
Method algorithm for design first quickly and effectively detects path for selecting, to reach the mesh for saving Diagnostic Time and resource consumption
's.Accordingly, it is considered to the characteristics of arriving data center network high connectivity, the present invention is first depending on existing fault-finding information and generates
Dynamic breadth First spanning tree is then based on given prior probability p analysis testing data and comes as the detective path between node
The probability of malfunction for primarily determining network members selects a reasonable threshold value finally by analysis probability distribution function to identify event
Hinder node, suspect node collection is classified as malfunctioning node collection and normal node collection.
HBFD mainly includes three parts:Dynamic spanning tree (DSTS), probability of malfunction assess (FPE) and fault reasoning
(FR).The main thought of each process is briefly described below.
Define 1:Give a dynamic spanning tree, the test mode s between any pair of adjacent nodei(also referred to as symptom)
By two indicator variable s={ rij,rjiComposition, wherein rij(rji) be node i (j) test node j (i) result.Here, rij
=0 means that node j is identified as normal node by node i, and rij=1 then means that node j is identified as malfunction by node i.
siThe matrix of composition is known as symptom matrix S.
Dynamic spanning tree search (DSTS) is dynamically generated detection tree according to the detection result of last time.Therefore HBFD can use up
It possibly avoids detecting other suspect nodes using malfunctioning node, effectively by the uncertainty of detection result in the dust (for example, when event
When barrier node i detects other nodes, there is the possibility that malfunctioning node is labeled as to normal condition, the detection result for being in symptomatic consequence
There is biggish false alarm rate).
Probability of malfunction assessment (FPE) is used to quantify the probability of malfunction of each suspect node.As shown in figure 3, the failure of node n
Probability depends not only on its prior probability (probability that malfunctioning node correctly diagnoses), also related to the diagnostic result of other nodes.
FPE obtains the Preliminary detection probability of malfunction of each node according to probability of malfunction table, then most by decision function calculate node
Whole probability of malfunction.Final probability of malfunction is higher, and node n is more likely to occur failure.
Finally, the probability of malfunction of all suspect nodes is partially further analyzed by fault reasoning (FR).Firstly, FR will
Absolute failure node (its probability of malfunction is equal to 1) is put into failure group.Then, FR selects suitable threshold value according to probability of malfunction table,
Suspect node is divided into Relative fault group (F) and relatively normal group (NF).
In view of HBFD algorithm is for the applicability of the data center network of different structure, the present invention considers the random company of generation
Logical network topology.The present invention is it is also assumed that be not present malicious node (with 1 generation error information of probability) in network, and at least
One Management Controller trusty is for collecting dialup test data and executing HBFD algorithm.
2, heuristic fault diagnosis
It is of the invention completely eliminate any particular network topology it is assumed that although can for different network topology structures
There can be specific optimization.The present invention does not also consider malicious node and link failure, and thinks in different data center networks
Deployment HBFD algorithm is feasible in topological (such as VL2, FiCoon and DCell).Although HBFD is a logic single entities,
But it still can be realized in a distributed fashion by the dialup test data that analysis distribution formula stores.The present invention is it is also supposed that net
The believable Management Controller of at least one in network (AC) is for collecting and analyzing dialup test data.
2.1 case study
Failure is to cause the network event of network communication problem root.The meeting when equipment cannot be used for routing or converting flow
Node failure occurs.Node failure may be as caused by many factors, for example, since hardware error causes device powers down to be repaired
Or collapse, packet loss or time-out response, all these failures all may cause the uncertain of testing result when network flow is excessive.
One or more data packets are sent the node in network by active probe, to detect the real-time shape of each node
State.It is defined according to pmc model, one group of test node can generate 6 different detection results:1) the test knot of two normal nodes
Fruit is that two nodes are in normal operating conditions (for example, rij=0, rji=0);2) when normal node detects detection failure section
Point when, as a result must be the node be malfunctioning node (for example, rij=0, rji=1);3) -6) no matter it is tested the state of node such as
What, the detection knot of malfunctioning node is uncertain (for example, r={ 0,1 }).The behavior of malfunctioning node is defined as disease by the present invention
Shape (for example, r=1).Fig. 3 provides two simply examples to illustrate to be caused by malfunctioning node in network to detect uncertain original
Cause.
Example 1:In Fig. 3-a, { a, b, c } is three nodes of network, it is assumed that node a and c are two malfunctioning nodes, node
B is normal node.The detection result of so { a, b } is any one gathered in { (0,1), (1,1) }, and the result of { b, c }
It is then one for gathering { (1,0), (1,1) }.When symptom group is { (0,1), (1,0) }, it is believed that node b is in shape;
But when symptom group is { (1,1), (1,1) }, it is believed that node b is malfunctioning node, remaining combination is then not enough to determine the shape of node b
Condition.
Example 2:In Fig. 3-b, when node a breaks down, no matter the state of node { b, c, d } how, this four nodes
State be all uncertain.
Define 2:Malfunctioning node obtains correct testing result according to given probability, defines probability prior probability p.
As given prior probability p, for malfunctioning node ni, when it detects normal node, it obtains r with Probability pij
=0 symptom, and r is obtained with probability 1-pij=1 symptom.
In the case where given network, when network state (for example, connectivity, time delay and packet loss rate) relatively
When good, prior probability p higher, so that dialup test data will be more effective.Detection result is entirely accurate as p=1, but it
It exists only in ideally.
The search of 2.2 dynamic spanning trees
For detection node failure, the characteristics of combined data central site network high connectivity, the present invention combines current detection
Information generates the path that breadth First spanning tree is mutually detected as node.Even if introducing prior probability to quantify detection result
Uncertainty, the testing result of malfunctioning node remain on be it is uncertain, detect obtained testing data deficiencies once to find
All malfunctioning nodes.A kind of point-to-point fault diagnosis algorithm being known as HFD, finds most of malfunctioning nodes by repeated detection,
The algorithm does not rationally reduce the uncertainty of each detection result using the status information obtained after each detection, so that visiting
It is larger to survey number, is not suitable for large-scale data center network.
Data center network has the characteristics that high connectivity.Therefore, excellent using range for large-scale data central site network
First search (BFS) algorithm can effectively improve the speed of searching detective path to search spanning tree.BFS is also it is possible to prevente effectively from single
One diagnoses and causes the higher situation of false alarm rate, as shown in Figure 3.Based on BFS, the present invention devises one kind as shown in algorithm 1
Heuristic breadth-first search (HBFS), wherein N is the group node in network, and NF is normal node set, and F is failure
Node set.
Heuristic breadth-first search is dynamically using the absolute failure node (event for including in a preceding detection result
Barrier probability is information 1) to find the newly-generated tree for avoiding detecting other nodes using absolute failure node.It is detected by reducing
As a result with the uncertainty of number of communications, HBFS algorithm effectively increases the accuracy and speed of detection of detection.
The assessment of 2.3 probabilities of malfunction
For the network topology structure generated at random, testing result 0 is defined as the node and is in symptom state, as a result 1
Then indicate that detected node is malfunctioning node.Then detection result shown in table 1 and its corresponding probability of malfunction are obtained.
Table 1:Probability of malfunction table
The probability of malfunction of each node can be obtained according to probability of malfunction table.In the base of breadth First spanning tree searching route
On plinth, the corresponding detection result of each node is uncertain.Therefore, present invention design decision letter on the basis of probability of malfunction
Number (DF) ψcTo determine the final probability of malfunction of each node in once detecting.
Pass through the unique probability of the available each node failure of DF.Finally by the failure that the determination of fault reasoning is last
Node.
2.4 fault reasoning
After obtaining the final probability of malfunction of each node, first by absolute failure node (for example, ψc()=1) it is put into event
Hinder node set (F), then selects two suitable threshold values to determine other malfunctioning nodes according to probability of malfunction table.For table 1
In four kinds of detection results, have analysis as described below:
1) result of Num.1 indicates that the node detected mutually is in same state.When priori Probability p >=0.3, the two
The probability of malfunction of node is below 0.5, then it is assumed that two nodes are all in normal condition.When priori Probability p<When 0.3, failure is general
Rate is higher than 0.5.This may cause decision function that higher probability of malfunction is distributed to normal probe node, also mean that
Deceptive information is provided there are malicious node to interfere failure diagnostic process.
2) detection result of Num.2 and Num.3 shows that there are absolute failure node (i.e. detection failures between detection node
Probability is that 1), regardless of prior probability p value, the probability of malfunction of another node is all very low, therefore, it is considered that probability of malfunction is 1
Node is absolute failure node, another node is in normal condition.
3) probability as the result is shown of Num.4 is always above 0.5, is mutually measured in node and contains at least one node failure, but nothing
Method determines, therefore is mutually measured node for two and is collectively labeled as suspect node and carries out detection next time making further determine.
In conclusion thinking that the node is malfunctioning node, prior probability p when the probability of malfunction of node is greater than 0.5<0.3.
Next time detect when in conjunction with upper primary detection result analysis, effectively avoid using malfunctioning node come and meanwhile detect it is multiple other
Network members reduce the uncertainty of detection result to improve speed of detection.
Application effect of the invention is explained in detail below with reference to emulation.
1, as the active point-to-point fault diagnosis algorithm based on data center network dialup test data, HBFD can spirit
Deployment living is simultaneously integrated with existing routing protocol, to improve the accuracy of fault diagnosis, reduces monitoring cost.Simulation result and Ben Fa
The performance of the algorithm of bright proposition.
Evaluation index includes:(1) detection times (DN);(2) failure granularity (FG);(3) correct diagnosis (CDR);(4) accidentally
Alert rate (FDR).Detection times are used to measure the number of communications that testing data are obtained between fault-finding process interior joint.DN value is got over
Low algorithm is better.PF indicates real malfunctioning node.Failure granularity is the ratio of network size and network size.
Correct diagnosis and the diagnosis of mistake are two evaluation indexes for measuring fault diagnosis algorithm accuracy, are write
At:
2, simulation result
In view of the applicability of HBFD algorithm, generates network topology and randomly choose malfunctioning node, simulation result such as Fig. 4-
Shown in Fig. 6, the performance of HBFD algorithm will be analyzed in terms of three.1) it is calculated the advantages of HBFD with hierarchical fault diagnosis (HFD)
Method is compared, and the advantage of HBFD algorithm is as shown in Figure 4.In the network topology structure comprising 5000 nodes, Prior Probability is
0.3, the range for detecting granularity is [0.05,0.5].This it appears that HBFD algorithm improves detection number well from figure
With correct diagnosis.It can effectively improve the accuracy of diagnosis this also means that avoiding detecting other nodes using malfunctioning node,
And effectively reduce invalid communication.But have one disadvantage in that, HBFD may result in low-down false alarm as shown in Figure 4
Rate analyzes the reason of causing false alarm rate for below.
2) influence of failure granularity
As shown in figure 5, being arranged FG in the range of [0.1,0.4].The accuracy of diagnosis is still highly stable.But with
The increase of failure granularity, false alarm rate are increased slightly.This is because more malfunctioning nodes in higher failure granularity instruction network.
HBFS can not find the dynamic spanning tree of not malfunctioning node.As described in table 1, the 4th kind of symptom occurs with higher probability.
Therefore, several normal nodes are confirmed as malfunctioning node.
3) influence of network size
Different network magnitude range [1000,5000] and identical failure granularity 0.5 are provided, as shown in Figure 6.It is very aobvious
So, correct diagnosis is stable in the network of different scales, and with the increase of network size, false detection rate
Declining.These results indicate that HBFD has a very strong robustness in the network of different scales, and false diagnosis compared with
It is low.
Of the invention is detected and is positioned most probable malfunctioning node based on the point-to-point fault diagnosis algorithm of testing data;
Compared with HFD algorithm, HBFD algorithm has better performance in terms of amount detection and diagnostic accuracy.In the network of different scales
The malfunctioning node in network can be accurately identified in topology under lower detection times.In subsequent work, on the one hand after
It is continuous to analyze the basic reason for causing algorithm false alarm rate, attempt the false alarm rate for eliminating HBFD algorithm;On the other hand consider that combination is trueer
Real network topology structure, HBFD algorithm is combined with different transport protocols.In order in diagnostic network malicious node or
New method introducing HBFD is also had certain value by other kinds of failure.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.
Claims (8)
1. a kind of data center network Fault Node Diagnosis method based on testing data, which is characterized in that described to be based on testing
The data center network Fault Node Diagnosis method of data generates dynamic breadth First according to existing fault-finding information and generates
Tree is as the detective path between node;The event of network members is primarily determined based on given prior probability p analysis testing data
Hinder probability;It selects a reasonable threshold value to identify malfunctioning node by analysis probability distribution function, suspect node collection is classified
For malfunctioning node collection and normal node collection.
2. the data center network Fault Node Diagnosis method based on testing data as described in claim 1, which is characterized in that
The data center network Fault Node Diagnosis method based on testing data gives a dynamic spanning tree, and any pair adjacent
Test mode s between nodeiBy two indicator variable s={ rij,rjiComposition, wherein rij(rji) it is node i (j) test section
The result of point j (i);rij=0 means that node j is identified as normal node, r by node iij=1 means node j by node i
It is identified as malfunction;siThe matrix of composition is known as symptom matrix S.
3. the data center network Fault Node Diagnosis method based on testing data as described in claim 1, which is characterized in that
The data center network Fault Node Diagnosis method based on testing data specifically includes:Dynamic spanning tree, probability of malfunction are commented
Estimate and fault reasoning;
Dynamic spanning tree search is dynamically generated detection tree according to the detection result of last time;
The probability of malfunction of each suspect node of probability of malfunction project evaluation chain;
Absolute failure node is put into failure group by fault reasoning;Suitable threshold value is selected according to probability of malfunction table, by suspect node
It is divided into Relative fault group and relatively normal group.
4. the data center network Fault Node Diagnosis method based on testing data as claimed in claim 3, which is characterized in that
The dynamic spanning tree search is based on heuristic breadth-first search, and wherein N is the group node in network, and NF is normal
Node set, F are malfunctioning node set;
Step 1, F ← the last time detection result;
Step 2,Turn to step 3;Else, turn to step 4;
Step 3, Find the breadth-first spanning tree by classical algorithm;
Step 4, NF ← N-F, use NF as the initial searching nodes.
5. the data center network Fault Node Diagnosis method based on testing data as claimed in claim 3, which is characterized in that
The probability of malfunction that the probability of malfunction assesses each node is obtained according to probability of malfunction table, using decision function ψcDetermine primary visit
The final probability of malfunction of each node in survey;
Obtain the unique probability of each node failure;Pass through the malfunctioning node that the determination of fault reasoning is last.
6. the data center network Fault Node Diagnosis method based on testing data as claimed in claim 3, which is characterized in that
The fault reasoning thinks that the node is malfunctioning node when the probability of malfunction of node is greater than 0.5.
7. a kind of data center network Fault Node Diagnosis method realized described in claim 1 based on testing data based on dial
The data center network Fault Node Diagnosis system of measured data, which is characterized in that data center's net based on testing data
Network Fault Node Diagnosis system includes:
First spanning tree module, for generating dynamic breadth First spanning tree as between node according to existing fault-finding information
Detective path;
Probability of malfunction determining module, for primarily determining network members based on given prior probability p analysis testing data
Probability of malfunction;
Categorization module will be suspicious for selecting a reasonable threshold value to identify malfunctioning node by analysis probability distribution function
Node collection is classified as malfunctioning node collection and normal node collection.
8. it is a kind of using described in claim 1~6 any one based on the data center network Fault Node Diagnosis of testing data
The data centre network system of method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810603564.2A CN108933694B (en) | 2018-06-09 | 2018-06-09 | Data center network fault node diagnosis method and system based on dial testing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810603564.2A CN108933694B (en) | 2018-06-09 | 2018-06-09 | Data center network fault node diagnosis method and system based on dial testing data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108933694A true CN108933694A (en) | 2018-12-04 |
CN108933694B CN108933694B (en) | 2021-11-09 |
Family
ID=64446368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810603564.2A Active CN108933694B (en) | 2018-06-09 | 2018-06-09 | Data center network fault node diagnosis method and system based on dial testing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108933694B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109802855A (en) * | 2018-12-28 | 2019-05-24 | 华为技术有限公司 | A kind of Fault Locating Method and device |
CN110247826A (en) * | 2019-07-10 | 2019-09-17 | 上海理工大学 | A kind of point to point network continuity testing method |
CN111327491A (en) * | 2020-01-20 | 2020-06-23 | 上海市大数据中心 | Server-centered pessimistic diagnosis method for data center network |
CN113595810A (en) * | 2021-06-17 | 2021-11-02 | 国网上海能源互联网研究院有限公司 | Interactive testing method and system suitable for power distribution network information |
CN113608072A (en) * | 2021-10-06 | 2021-11-05 | 深圳市景星天成科技有限公司 | Electric power self-healing rapid fault positioning method based on non-sound condition |
CN114978794A (en) * | 2022-05-19 | 2022-08-30 | 北京有竹居网络技术有限公司 | Network access method, device, storage medium and electronic equipment |
CN115914009A (en) * | 2021-08-10 | 2023-04-04 | 中国移动通信集团江苏有限公司 | ToB private network service quality testing method and system |
CN117114102A (en) * | 2023-10-13 | 2023-11-24 | 江苏前景瑞信科技发展有限公司 | Transformer fault diagnosis method based on Bayesian network and fault tree |
CN117955850A (en) * | 2023-07-31 | 2024-04-30 | 非凡士智能科技(苏州)有限公司 | Method for detecting networking performance of Internet of things system and improving stability |
CN118473910A (en) * | 2024-07-08 | 2024-08-09 | 鄂尔多斯市泛胜数据技术有限公司 | Electric power Internet of things fault detection method and system based on edge cloud cooperation |
CN115914009B (en) * | 2021-08-10 | 2024-10-22 | 中国移动通信集团江苏有限公司 | ToB private network service quality testing method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101715203A (en) * | 2009-11-30 | 2010-05-26 | 中国移动通信集团浙江有限公司 | Method and device for automatically positioning fault points |
CN103856789A (en) * | 2014-03-13 | 2014-06-11 | 赛特斯信息科技股份有限公司 | System and method for achieving OTT service quality guarantee based on user behavior analysis |
-
2018
- 2018-06-09 CN CN201810603564.2A patent/CN108933694B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101715203A (en) * | 2009-11-30 | 2010-05-26 | 中国移动通信集团浙江有限公司 | Method and device for automatically positioning fault points |
CN103856789A (en) * | 2014-03-13 | 2014-06-11 | 赛特斯信息科技股份有限公司 | System and method for achieving OTT service quality guarantee based on user behavior analysis |
Non-Patent Citations (5)
Title |
---|
QI XIAOGANG等: "Fault diagnosis based on dial-test data in datacenter networks", 《JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS》 * |
RONG ZHOU等: "Breadth-first heuristic search", 《ARTIFICIAL INTELLIGENCE》 * |
党克等: "配电网计划孤岛划分方法研究 ", 《中国电力》 * |
王伟等: "多值属性系统的故障诊断策略最优化方法 ", 《仪器仪表学报》 * |
王冰纯: "基于数据析的网络诊断算法研究", 《CNKI优秀硕士学位论文全文数据库》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109802855A (en) * | 2018-12-28 | 2019-05-24 | 华为技术有限公司 | A kind of Fault Locating Method and device |
CN110247826A (en) * | 2019-07-10 | 2019-09-17 | 上海理工大学 | A kind of point to point network continuity testing method |
CN110247826B (en) * | 2019-07-10 | 2022-03-25 | 上海理工大学 | Point-to-point network connectivity test method |
CN111327491A (en) * | 2020-01-20 | 2020-06-23 | 上海市大数据中心 | Server-centered pessimistic diagnosis method for data center network |
CN113595810A (en) * | 2021-06-17 | 2021-11-02 | 国网上海能源互联网研究院有限公司 | Interactive testing method and system suitable for power distribution network information |
CN113595810B (en) * | 2021-06-17 | 2023-09-26 | 国网上海能源互联网研究院有限公司 | Interactive testing method and system suitable for power distribution network information |
CN115914009A (en) * | 2021-08-10 | 2023-04-04 | 中国移动通信集团江苏有限公司 | ToB private network service quality testing method and system |
CN115914009B (en) * | 2021-08-10 | 2024-10-22 | 中国移动通信集团江苏有限公司 | ToB private network service quality testing method and system |
CN113608072A (en) * | 2021-10-06 | 2021-11-05 | 深圳市景星天成科技有限公司 | Electric power self-healing rapid fault positioning method based on non-sound condition |
CN113608072B (en) * | 2021-10-06 | 2021-12-28 | 深圳市景星天成科技有限公司 | Electric power self-healing rapid fault positioning method based on non-sound condition |
CN114978794B (en) * | 2022-05-19 | 2023-06-23 | 北京有竹居网络技术有限公司 | Network access method, device, storage medium and electronic equipment |
CN114978794A (en) * | 2022-05-19 | 2022-08-30 | 北京有竹居网络技术有限公司 | Network access method, device, storage medium and electronic equipment |
CN117955850A (en) * | 2023-07-31 | 2024-04-30 | 非凡士智能科技(苏州)有限公司 | Method for detecting networking performance of Internet of things system and improving stability |
CN117114102A (en) * | 2023-10-13 | 2023-11-24 | 江苏前景瑞信科技发展有限公司 | Transformer fault diagnosis method based on Bayesian network and fault tree |
CN118473910A (en) * | 2024-07-08 | 2024-08-09 | 鄂尔多斯市泛胜数据技术有限公司 | Electric power Internet of things fault detection method and system based on edge cloud cooperation |
CN118473910B (en) * | 2024-07-08 | 2024-09-10 | 鄂尔多斯市泛胜数据技术有限公司 | Electric power Internet of things fault detection method and system based on edge cloud cooperation |
Also Published As
Publication number | Publication date |
---|---|
CN108933694B (en) | 2021-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108933694A (en) | Data center network Fault Node Diagnosis method and system based on testing data | |
CN104270268B (en) | A kind of distributed system network performance evaluation and method for diagnosing faults | |
CN1925437B (en) | System and method for detecting status changes in a network | |
CN104796273B (en) | A kind of method and apparatus of network fault root diagnosis | |
CN115118581B (en) | Internet of things data all-link monitoring and intelligent guaranteeing system based on 5G | |
EP2286337A2 (en) | Ranking the importance of alerts for problem determination in large systems | |
CN109039763A (en) | A kind of network failure nodal test method and Network Management System based on backtracking method | |
CN102684902B (en) | Based on the network failure locating method of probe prediction | |
CN104935458B (en) | A kind of performance bottleneck analysis method and device based on distributed automatization measurement | |
CN115237717A (en) | Micro-service abnormity detection method and system | |
CN103023028A (en) | Rapid grid failure positioning method based on dependency graph of entities | |
CN111884859B (en) | Network fault diagnosis method and device and readable storage medium | |
CN112367191A (en) | Service fault positioning method under 5G network slice | |
CN107147534A (en) | A kind of probe deployment method of quantity optimization for power telecom network fault detect | |
CN112383934A (en) | Multi-domain cooperation service fault diagnosis method under 5G network slice | |
CN111600805A (en) | Bayes-based power data network congestion link inference algorithm | |
CN110557275B (en) | Electric power communication network detection station selection algorithm based on network intrinsic characteristics | |
Nie et al. | Passive diagnosis for WSNs using data traces | |
CN102281103A (en) | Optical network multi-fault recovering method based on fuzzy set calculation | |
CN113890820A (en) | Data center network fault node diagnosis method and system | |
Xu et al. | Distributed fault diagnosis of wireless sensor networks | |
CN116011813A (en) | Urban rail transit emergency monitoring method and device, electronic equipment and storage medium | |
CN115664928A (en) | Interpretable graph-based root cause positioning method and device | |
CN117376084A (en) | Fault detection method, electronic equipment and medium thereof | |
Patil et al. | Probe station placement algorithm for probe set reduction in network fault localization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |