CN108933694B - Data center network fault node diagnosis method and system based on dial testing data - Google Patents

Data center network fault node diagnosis method and system based on dial testing data Download PDF

Info

Publication number
CN108933694B
CN108933694B CN201810603564.2A CN201810603564A CN108933694B CN 108933694 B CN108933694 B CN 108933694B CN 201810603564 A CN201810603564 A CN 201810603564A CN 108933694 B CN108933694 B CN 108933694B
Authority
CN
China
Prior art keywords
fault
node
probability
network
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810603564.2A
Other languages
Chinese (zh)
Other versions
CN108933694A (en
Inventor
齐小刚
王冰纯
刘立芳
冯海林
胡绍林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201810603564.2A priority Critical patent/CN108933694B/en
Publication of CN108933694A publication Critical patent/CN108933694A/en
Application granted granted Critical
Publication of CN108933694B publication Critical patent/CN108933694B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/0636Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis based on a decision tree analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to the technical field of supervision, monitoring or testing devices, and discloses a data center network fault node diagnosis method and system based on dial-up test data, wherein a dynamic breadth-first spanning tree is generated according to the existing fault detection information and is used as a detection path between nodes; analyzing the dial-up test data based on a given prior probability p to preliminarily determine the fault probability of the network member; and selecting a reasonable threshold value through analyzing a probability distribution function to identify the fault node, and classifying the suspicious node set into a fault node set and a normal node set. The HBFD algorithm has better performance in terms of both the number of detections and the diagnostic accuracy than the HFD algorithm. The fault node in the network can be accurately identified under the condition of lower detection times in network topologies with different scales. The introduction of new methods into HBFD for the purpose of diagnosing malicious nodes or other types of failures in a network is also of some research value.

Description

Data center network fault node diagnosis method and system based on dial testing data
Technical Field
The invention belongs to the technical field of supervision, monitoring or testing devices, and particularly relates to a data center network fault node diagnosis method and system based on dial-up test data.
Background
Currently, the current state of the art commonly used in the industry is such that:with the advent of the big data age, the increasing demand of cloud computing has enabled the scale of data center networks to be expanded. Today, data center networks contain hundreds of thousands of servers connected by Network Interface Cards (NICs), switches and routers, cables and fibers, which are mostly distributed and high in heightAnd (4) flow rate. In large systems, detecting and locating faults is important for network management systems to restore network communications through a fault recovery mechanism. Although there are many studies devoted to the fault diagnosis strategy, the problems described below still need to be solved. 1) Complexity of diagnosis: in addition to the higher time and space complexity for fault localization, the increase in network size may also lead to more complex fault diagnosis due to dynamic, incomplete and uncertain information. Therefore, it is significant to effectively reduce the number of detections and detection efficiency of the failure diagnosis. 2) The network load increases: data centers may significantly shorten algorithm execution times, but may also result in an increased likelihood of controller overhead. One possible solution is to apply a participation strategy to control the number of monitors, while another strategy is to increase the effectiveness of the probe data to reduce the amount of data. The existing network fault diagnosis technology is mainly divided into three categories: passive fault detection, active fault detection and fault mode identification method based on network log. The passive fault detection method monitors the real-time performance of the network by deploying a monitoring agent in the network, and passively acquires the state information of network members. A method of placing passive monitoring devices on specific links in a network, where the agents monitor the current status of the members of the network by facilitating all links in the network at a given time, but this method has the disadvantage of generating redundant monitoring agents in large-scale networks. A passive fault diagnosis method using a dependency graph can only detect a limited number of fault nodes at one time, and is not suitable for the environment of a large-scale network. Bayesian Belief Networks (BBNs) are also widely used in fault detection technology. The BBN models a network structure into a directed acyclic graph, and then tries to find a failed node by analyzing end-to-end observable symptoms, and fault reasoning has higher time complexity in a large-scale network, so that a fault management system cannot timely and effectively recover network communication. Active fault diagnosis typically uses probes to detect the condition of the server, these selected probes being transmitted to obtain end-to-end statistics such as packet loss rate, delay and throughput. The controller then collects these statistics to obtain further information aboutActive detection reasoning, it is very important to design a suitable and effective detection strategy. A fault diagnosis system architecture using adaptive probing. Most probe-based technologies contain three components: detection station selection, detector selection and fault reasoning. But these approaches are limited by traffic overhead in large-scale networks. A phased test method for reducing network traffic overhead uses only a small set of probes to detect a small area of the network at each stage. However, how to find the reasonable arrangement probe station and the problem of the failure of the probe station are still to be discussed further. A probe station selection algorithm to minimize the number of probe stations and make the probe stations robust against failure. However, the problem of how to place a probe station to monitor for a faulty station remains unsolved. With the development of big data technology, fault diagnosis technology based on log data draws a lot of attention. Network system log based techniques are typically based on a threshold algorithm that first sets appropriate thresholds for different detection capabilities of the network based on the experience of the network administrator, and then detects faults by comparing actual values to default thresholds. This technique is simple but has two distinct disadvantages: 1) its threshold is empirically chosen; 2) data below the threshold is not analyzed, resulting in that some detailed information about the network condition may be missed. A novel analysis system for active fault diagnosis not only considers keywords of abnormal logs such as errors and decline, but also tries to find a sudden fault mode. However, data-based algorithms have a high temporal complexity in terms of data preprocessing (such as data extraction, data cleanup, and exception handling).
In summary, the problems of the prior art are as follows:
(1) the passive fault detection method can generate redundant monitoring agents in a large-scale network, so that a plurality of useless detection packets exist in the network, when the network has a large scale and is busy in service, the redundant detection packets can influence the normal service of the network and even influence the network fault diagnosis result, and the passive fault diagnosis is not suitable for the environment of the large-scale network. Therefore, in a large-scale network, on one hand, an active fault monitoring technology can be adopted, and redundant detection packets in the network can be effectively reduced. On the other hand, it is necessary to improve the effectiveness of the probe packet and reduce the base number of the probe for performing fault diagnosis and detection.
(2) The active fault diagnosis is limited by the flow overhead in a large-scale network, a reasonable and effective detection base station needs to be placed in the network, the position and the number of the detection base station directly influence the accuracy of a fault diagnosis result, but the existing research does not solve the problems of the detection base station and the number. In a large-scale network, designing a probe path covering the network has great time complexity, and recalculation is needed when the network topology changes, so that the method is not suitable for a dynamic network structure.
(3) The decision threshold is empirically selected in the web log based technique; on the one hand, since the technology does not analyze data below the threshold, some detailed information about the network condition may be missed. When the network state has unpredictable mutation, the current state of the network cannot be accurately judged by the experience threshold, so that the network fault management system cannot acquire fault information in the network. The analysis of all fault data has the problems of large time complexity and more redundant information.
The difficulty and significance for solving the technical problems are as follows:in a large-scale data center network, a passive fault detection technology is insufficient in real-time performance and effectiveness, and an active fault detection technology has a problem of how to select a detection base station and a detection path. For a large-scale complex network, the problem that a sending probe traverses a network path once is an NP-hard problem, recalculation is needed for each network change, and great limitations are caused to network topology reconstruction, optimization and the like, so that a new reasonable and effective fault diagnosis mode is very necessary in engineering. On the other hand, when judging whether a network node fails, the traditional method relying on manual experience also has great limitation, so that establishing a proper model and selecting a proper threshold value for different network structures is also very significant in research.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a data center network fault node diagnosis method and system based on dial testing data.
The data center network fault node diagnosis method based on the dial-up test data generates a dynamic breadth-first spanning tree as a detection path between nodes according to the existing fault detection information; analyzing the dial-up test data based on a given prior probability p to preliminarily determine the fault probability of the network member; and selecting a reasonable threshold value through analyzing a probability distribution function to identify the fault node, and classifying the suspicious node set into a fault node set and a normal node set.
Furthermore, the data center network fault node diagnosis method based on dial test data gives a dynamic spanning tree and a test state s between any pair of adjacent nodesiWith two indicating variables s ═ rij,rjiIs composed of (i) a compound of formula (i) }, wherein rij(rji) Is the result of testing node i (j) for node j (i); r isij0 means that node i identifies node j as a normal node, r ij1 means that node j is identified by node i as a fault condition; siThe composed matrix is called the symptom matrix S.
Further, the data center network fault node diagnosis method based on dial testing data specifically comprises the following steps: dynamic spanning tree, failure probability evaluation and failure reasoning;
dynamically generating a detection tree according to the last detection result by the dynamic spanning tree search;
evaluating and quantifying the fault probability of each suspicious node by the fault probability;
fault reasoning puts absolute fault nodes into a fault group; and selecting a proper threshold according to the fault probability table, and dividing the suspicious nodes into a relatively fault group and a relatively normal group.
Further, the dynamic spanning tree search is based on a heuristic breadth-first search algorithm, wherein N is a group of nodes in the network, NF is a normal node set, and F is a fault node set;
step one, F ← the last time detection result;
in the second step, the first step is that,
Figure GDA0003244876040000041
turn to step three; else, turn to step four;
step three, the fibre the break-first mapping tree by the structural algorithm;
step four, NF ← N-F, use NF as the initial searching nodes.
Further, the fault probability of each node is evaluated according to the fault probability table, and a decision function psi is adoptedcDetermining the final fault probability of each node in one detection;
Figure GDA0003244876040000042
obtaining the unique probability of each node failure; and determining the final fault node through fault reasoning.
Further, the fault reasoning considers that the node is a fault node when the fault probability of the node is greater than 0.5.
Another object of the present invention is to provide a data center network fault node diagnosis system based on dial test data for implementing the data center network fault node diagnosis method based on dial test data, where the data center network fault node diagnosis system based on dial test data includes:
the priority spanning tree module is used for generating a dynamic breadth priority spanning tree as a detection path between nodes according to the existing fault detection information;
the failure probability determination module is used for analyzing the dial-up test data based on the given prior probability p to preliminarily determine the failure probability of the network members;
and the classification module is used for selecting a reasonable threshold value through analyzing a probability distribution function to identify the fault node and classifying the suspicious node set into a fault node set and a normal node set.
The invention also aims to provide a data center network system applying the data center network fault node diagnosis method based on dial testing data.
In summary, the advantages and positive effects of the invention are:the problems of selecting the number of the detection base stations and placing the detection base stations in the network are solved by adopting a point-to-point detection technology, and only a reliable data processing center exists in the network. In the aspect of detecting the path, the characteristic of larger node degree in the data center network is combined, the breadth-first search path is generated through the breadth-first algorithm to obtain the dial-up test data, and the calculation cost caused by network topology reconstruction and optimization is reduced. An effective probability distribution function is designed by combining probability calculation, and a reasonable judgment threshold is selected on the basis of the probability distribution function, so that the influence of human experience is effectively avoided. Experimental results show that compared with the HFD algorithm, the HBFD algorithm has better performance in terms of both the number of detections and the diagnostic accuracy. The fault node in the network can be accurately identified under the condition of lower detection times in network topologies with different scales. The method is a fault diagnosis technology more suitable for large-scale data center networks.
Drawings
Fig. 1 is a flowchart of a data center network fault node diagnosis method based on dial-up test data according to an embodiment of the present invention.
Fig. 2 is a flowchart of an implementation of a data center network fault node diagnosis method based on dial-up test data according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of fault detection provided by the embodiment of the present invention.
Fig. 4 is a diagram illustrating the comparison result between the HBFD algorithm and the HFD algorithm according to the embodiment of the present invention.
Fig. 5 is a schematic diagram of the impact of the fault granularity provided by the embodiment of the present invention.
Fig. 6 is a schematic diagram of the network size influence provided by the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The rapid growth in size and structural complexity of data center networks has made network management systems increasingly challenging. Since network failures (node failures or link failures) are inevitable, finding a method for rapidly diagnosing network failures so as to be able to effectively recover network communication functions has become an important research topic in academia and industry. The method can quickly and effectively monitor the node faults of the data center network, and determine the fault nodes by 1) generating a dynamic search tree, 2) analyzing dial-up test data and 3) selecting a reasonable threshold value on the basis of a heuristic breadth-first diagnosis algorithm. Simulation results show that the HBFD algorithm can effectively diagnose node faults and effectively reduce detection times and false alarm rate under the condition of ensuring diagnosis accuracy.
As shown in fig. 1, the data center network fault node diagnosis method based on dial-up test data according to the embodiment of the present invention includes the following steps:
s101: generating a dynamic breadth-first spanning tree as a detection path between nodes according to the existing fault detection information;
s102: analyzing the dial-up test data based on a given prior probability p to preliminarily determine the fault probability of the network member;
s103: and selecting a reasonable threshold value through analyzing a probability distribution function to identify the fault node, and classifying the suspicious node set into a fault node set and a normal node set.
The application of the principles of the present invention will now be described in further detail with reference to the accompanying drawings.
1. Algorithm overview
The invention reflects the network of the real world through the undirected graph, and consists of a group of nodes N, wherein the nodes are connected by a group of links L, and the data center network has the characteristic of high connectivity. The current state of the art is considered to fully allow elements in a network to communicate with each other; the invention obtains the real-time state information of the nodes through mutual testing between adjacent nodes. The detection method firstly designs an algorithm for selecting a quick and effective detection path so as to achieve the purposes of saving diagnosis time and resource consumption. Therefore, in consideration of the characteristic of high connectivity of a data center network, the invention firstly generates a dynamic breadth-first spanning tree as a detection path between nodes according to the existing fault detection information, then analyzes dial-test data based on a given prior probability p to preliminarily determine the fault probability of network members, finally selects a reasonable threshold value through an analysis probability distribution function to identify fault nodes, and classifies a suspicious node set into a fault node set and a normal node set.
HBFD mainly consists of three parts: dynamic Spanning Tree (DSTS), Failure Probability Evaluation (FPE) and Failure Reasoning (FR). The main idea of each process is briefly described below.
Definition 1: given a dynamic spanning tree, the test state s between any pair of adjacent nodesi(also called symptoms) are represented by two indicator variables s ═ { r ═ rij,rjiIs composed of (i) a compound of formula (i) }, wherein rij(rji) Is the result of testing node i (j) for node j (i). Here, rij ═ 0 means that the node i identifies the node j as a normal node, and rij ═ 1 means that the node j is recognized as a failure state by the node i. siThe composed matrix is called the symptom matrix S.
Dynamic Spanning Tree Search (DSTS) dynamically generates a probe tree based on the last probe result. Therefore, the HBFD can avoid detecting other suspicious nodes by using the failed node as much as possible, and effectively avoid uncertainty of the detection result (for example, when the failed node i detects other nodes, there is a possibility that the failed node is marked as a normal state in the symptom result, so that the detection result has a large false alarm rate).
Failure Probability Evaluation (FPE) is used to quantify the failure probability of each suspect node. As shown in fig. 3, the failure probability of the node n depends not only on its prior probability (probability of correct diagnosis of the failed node), but also is related to the diagnosis results of other nodes. And the FPE obtains the initial detection fault probability of each node according to the fault probability table, and then calculates the final fault probability of the node through a decision function. The higher the probability of ultimate failure, the more likely node n is to fail.
Finally, the failure probability of all suspicious nodes is further analyzed by a Failure Reasoning (FR) part. First, the FR puts the absolute failed node (whose failure probability equals 1) into the failed group. Then, the FR selects an appropriate threshold value according to the failure probability table, and divides the suspicious nodes into a relatively failed group (F) and a relatively normal group (NF).
In view of the applicability of the HBFD algorithm to data center networks of different architectures, the present invention considers randomly generated connected network topologies. The present invention also assumes that no malicious nodes exist in the network (error messages are generated with probability 1) and that at least one trusted management controller is used to collect dial-up test data and execute the HBFD algorithm.
2. Heuristic fault diagnosis
The present invention eliminates the assumption of any particular network topology altogether, although there may be particular optimizations for different network topologies. The present invention also does not take into account malicious node and link failures and considers it feasible to deploy the HBFD algorithm in different data center network topologies (e.g., VL2, ficeon and DCell). Although HBFD is a logical single entity, it can still be implemented in a distributed manner by analyzing the distributively stored dial-up test data. The invention also assumes that at least one trusted management controller (AC) in the network is used to collect and analyze dial-up test data.
2.1 problem analysis
Failures are network events that cause a source of network communication problems. Node failures occur when a device is not available to route or forward traffic. Node failures may be caused by many factors, such as, for example, a device being powered off for repair or crash due to hardware errors, a packet being dropped or a timeout response when network traffic is too large, all of which may cause uncertainty in the detection result.
Active probing sends one or more data packets to nodes in the network to detect the real-time status of each node. According to the PMC model definition, a set of test nodes will produce 6 different probing results: 1) the test results of the two normal nodes are that both nodes are in a normal operating state (e.g., rij 0, rji 0); 2) when a normal node detects a failed node, the result must be that the node is a failed node (e.g., rij 0, rji 1); 3) -6) the probing junction of the failed node is indeterminate regardless of the state of the node under test (e.g., r ═ {0,1 }). The present invention defines the behavior of a failed node as a symptom (e.g., r ═ 1). Fig. 3 gives two simple examples to illustrate the cause of probing uncertainty caused by a failed node in the network.
Example 1: in fig. 3-a, { a, b, c } are three nodes of the network, assuming that nodes a and c are two failed nodes and node b is a normal node. Then the result of the detection of { a, b } is any one of the sets { (0,1), (1,1) }, and the result of { b, c } is one of the sets { (1,0), (1,1) }. When the symptom group is { (0,1), (1,0) }, the node b is considered to be in a good state; however, when the symptom groups are { (1,1), (1,1) }, node b is considered to be the failed node, and the remaining combinations are not sufficient to determine the condition of node b.
Example 2: in fig. 3-b, when node a fails, the states of the four nodes are indeterminate regardless of the state of nodes { b, c, d }.
Definition 2: and the fault node acquires a correct detection result according to a given probability, and the probability is defined as a prior probability p.
As given a priori probability p, for a failed node niWhen it detects a normal node, it obtains r with a probability p ij0 and r is obtained with a probability of 1-pijSymptoms of 1.
In the case of a given network, when the network conditions (e.g., connectivity, time delay, and packet loss rate) are relatively good, the prior probability p is high, and thus the dial-up test data will be more efficient. The detection result is completely accurate when p is 1, but it exists only in the ideal case.
2.2 dynamic spanning Tree searching
In order to detect node faults, the characteristic of high connectivity of a data center network is combined, and the breadth-first spanning tree is generated by combining the current detection information to be used as a path for mutual detection of nodes. Even if the prior probability is introduced to quantify the uncertainty of the detection result, the detection result of the fault node is still uncertain, and the dial test data obtained by one detection is not enough to find all fault nodes. A point-to-point fault diagnosis algorithm called HFD discovers most fault nodes through multiple detections, and the algorithm does not reasonably utilize state information obtained after each detection to reduce uncertainty of each detection result, so that the detection times are large, and the algorithm is not suitable for large-scale data center networks.
The data center network has the characteristic of high connectivity. Therefore, for a large-scale data center network, the search of the spanning tree by using a breadth-first search (BFS) algorithm can effectively improve the speed of searching the detection path. BFS may also effectively avoid situations where a single diagnosis causes a high false alarm rate, as shown in fig. 3. Based on BFS, the invention designs a heuristic breadth first search algorithm (HBFS) as shown in algorithm 1, wherein N is a group of nodes in the network, NF is a normal node set, and F is a fault node set.
Figure GDA0003244876040000091
The heuristic breadth-first search algorithm dynamically uses the information of the absolute fault node (fault probability is 1) contained in the previous detection result to find a new spanning tree which avoids detecting other nodes by using the absolute fault node. By reducing the uncertainty of the detection result and the communication frequency, the HBFS algorithm effectively improves the detection accuracy and the detection speed.
2.3 Fault probability assessment
For a randomly generated network topology structure, a dialing test result 0 is defined as that the node is in a symptom state, and a result 1 indicates that the detected node is a fault node. Then, the detection results and the corresponding failure probabilities shown in table 1 are obtained.
Table 1: fault probability table
Figure GDA0003244876040000101
The failure probability of each node may be obtained from a failure probability table. On the basis of the breadth-first spanning tree search path,the probe result corresponding to each node is uncertain. The invention thus designs the Decision Function (DF) ψ on the basis of the probability of a faultcTo determine the final failure probability of each node in a probe.
Figure GDA0003244876040000102
The unique probability of failure of each node can be obtained by DF. And finally, determining the final fault node through fault reasoning.
2.4 Fault reasoning
Having obtained the final failure probability for each node, the absolute failed node (e.g., # is first determinedc(. 1) is put into the failure node set (F), and then two appropriate thresholds are selected according to the failure probability table to determine other failure nodes. For the four detection results in table 1, the following analyses were performed:
1) the result of num.1 indicates that the nodes probed each other are in the same state. And when the prior probability p is more than or equal to 0.3 and the fault probability of the two nodes is lower than 0.5, the two nodes are considered to be in a normal state. The probability of failure is higher than 0.5 when the prior probability p < 0.3. This may cause the decision function to assign a higher probability of failure to the normal probing nodes, meaning that there are malicious nodes that provide false information to interfere with the failure diagnosis process.
2) The detection results of num.2 and num.3 indicate that there is an absolute fault node between the detection nodes (i.e. the detection fault probability is 1), and the fault probability of the other node is very low regardless of the prior probability p, so that the node with the fault probability of 1 is considered as an absolute fault node, and the other node is in a normal state.
3) The result of num.4 shows that the probability is always higher than 0.5, but it cannot be determined whether at least one node fault exists in the inter-test nodes, so that both inter-test nodes are marked as suspicious nodes to perform the next detection for further determination.
In summary, when the failure probability of a node is greater than 0.5, the node is considered as a failed node, and the prior probability p is less than 0.3. And when the next detection is carried out, the analysis of the last detection result is combined, so that the situation that a plurality of other network members are simultaneously detected by adopting a fault node is effectively avoided, and the uncertainty of the detection result is reduced so as to improve the detection speed.
The application effect of the present invention will be described in detail with reference to the simulation.
1. As an active point-to-point fault diagnosis algorithm based on data center network dialing test data, HBFD can be flexibly deployed and integrated with the existing routing protocol to improve the accuracy of fault diagnosis and reduce the monitoring cost. Simulation results and performance of the algorithm proposed by the present invention.
The evaluation index includes: (1) number of probes (DN); (2) fault Granularity (FG); (3) correct Diagnostic Rate (CDR); (4) false alarm rate (FDR). The detection times are used for measuring the communication times of the dialing test data acquired between the nodes in the fault detection process. The lower the DN value the better the algorithm. The PF represents the true failed node. The failure granularity is the ratio of the network size to the network size.
Figure GDA0003244876040000111
The correct diagnosis rate and the incorrect diagnosis rate are two evaluation indexes for measuring the accuracy of the fault diagnosis algorithm, and are written as follows:
Figure GDA0003244876040000112
Figure GDA0003244876040000113
2. simulation result
Considering the applicability of the HBFD algorithm, generating a network topology and randomly selecting a failed node, the performance of the HBFD algorithm will be analyzed from three aspects as shown in fig. 4-6 as simulation results. 1) Advantages of HBFD the advantages of the HBFD algorithm compared to the Hierarchical Fault Diagnosis (HFD) algorithm are shown in fig. 4. In a network topology containing 5000 nodes, the prior probability value is 0.3, and the detection granularity ranges from 0.05 to 0.5. It is evident from the figure that the HBFD algorithm improves the number of tests and the correct diagnosis rate very well. This also means that avoiding the use of a failed node to detect other nodes can effectively improve the accuracy of the diagnosis and effectively reduce invalid communications. However, as a disadvantage, HBFD may cause a very low false alarm rate as shown in fig. 4, and the cause of the false alarm rate will be analyzed below.
2) Effect of Fault granularity
As shown in fig. 5, FG is set in the range of [0.1,0.4 ]. The accuracy of the diagnosis is still very stable. But as the granularity of the fault increases, the false alarm rate slightly increases. This is because a higher granularity of failure indicates more failed nodes in the network. HBFS cannot find a dynamic spanning tree without failed nodes. As described in table 1, the fourth symptom appears with a higher probability. Therefore, several normal nodes are determined as the failed nodes.
3) Influence of network size
Different network size ranges [1000, 5000] and the same fault granularity 0.5 are given, as shown in fig. 6. It is clear that the correct diagnosis rate is stable in networks of different sizes, and as the size of the network increases, the false detection rate also decreases. These results indicate that HBFD is robust in networks of different sizes and has a low false diagnosis rate.
The most probable fault node is detected and positioned by the point-to-point fault diagnosis algorithm based on dial testing data; the HBFD algorithm has better performance in terms of both the number of detections and the diagnostic accuracy than the HFD algorithm. The fault node in the network can be accurately identified under the condition of lower detection times in network topologies with different scales. In subsequent work, on one hand, the root cause of the error alarm rate of the algorithm is continuously analyzed, and the error alarm rate of the HBFD algorithm is tried to be eliminated; on the other hand, the HBFD algorithm is combined with different transmission protocols in consideration of combination with a more real network topology. It is also of value to introduce new methods into HBFD in order to diagnose malicious nodes or other types of failures in the network.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (2)

1. A data center network fault node diagnosis method based on dial-up test data is characterized in that the data center network fault node diagnosis method based on dial-up test data generates a dynamic breadth-first spanning tree as a detection path between nodes according to the existing fault detection information; analyzing the dial-up test data based on a given prior probability p to preliminarily determine the fault probability of the network member; selecting a reasonable threshold value through analyzing a probability distribution function to identify a fault node, and classifying a suspicious node set into a fault node set and a normal node set; the fault node acquires a correct detection result according to a given probability, and the probability is defined as a prior probability p;
when the fault probability of the node is greater than 0.5, the node is considered as a fault node, and the prior probability p is less than 0.3;
the data center network fault node diagnosis method based on dial test data gives a dynamic spanning tree and a test state s between any pair of adjacent nodesiWith two indicating variables s ═ rij,rjiIs composed of (i) a compound of formula (i) }, wherein rijIs the result of node i testing node j; r isjiIs the result of node j testing node i; r isij0 means that node i identifies node j as a normal node, rij1 means that node j is identified by node i as a fault condition; siThe composed matrix is called symptom matrix S;
the data center network fault node diagnosis method based on dial testing data specifically comprises the following steps: dynamic spanning tree, failure probability evaluation and failure reasoning;
dynamically generating a detection tree according to the last detection result by the dynamic spanning tree search;
evaluating and quantifying the fault probability of each suspicious node by the fault probability;
fault reasoning puts absolute fault nodes into a fault group; selecting a proper threshold value according to the fault probability table, and dividing suspicious nodes into a relative fault group and a relative normal group;
the dynamic spanning tree search is based on a heuristic breadth-first search algorithm, wherein N is a group of nodes in a network, NF is a normal node set, and F is a fault node set;
the method comprises the following steps: obtaining the absolute fault node of the last detection result and assigning the absolute fault node to the F, and turning to the second step;
step two: if the F is not empty, skipping to the third step, otherwise, turning to the fourth step;
step three: obtaining a breadth-first tree by combining with network topology, starting primary detection, determining the fault probability of the network node in the current detection by combining with a fault probability table and a decision function, determining an absolute fault node, and turning to the step four
Step four: if the absolute fault node exists, turning to the step 1, if the absolute fault node does not exist, finishing detection, and outputting all nodes with the fault probability larger than 0.5 as fault nodes;
wherein the fault probability evaluation of the fault probability of each node is obtained according to a fault probability table by adopting a decision function psicDetermining the final fault probability of each node in one detection;
f(nj)=max{eij|eij∈E}(nj∈N)
Figure FDA0003249258320000021
obtaining the unique probability of each node failure; and determining the last fault node through fault reasoning.
2. A data center network fault node diagnosis system based on dial-up test data for implementing the data center network fault node diagnosis method based on dial-up test data according to claim 1, wherein the data center network fault node diagnosis system based on dial-up test data comprises:
the priority spanning tree module is used for generating a dynamic breadth priority spanning tree as a detection path between nodes according to the existing fault detection information;
the failure probability determination module is used for analyzing the dial-up test data based on the given prior probability p to preliminarily determine the failure probability of the network members; the fault node acquires a correct detection result according to a given probability, and the probability is defined as a prior probability p;
the classification module is used for selecting a reasonable threshold value through analyzing a probability distribution function to identify a fault node and classifying a suspicious node set into a fault node set and a normal node set; and when the failure probability of the node is more than 0.5, the node is considered as a failure node, and the prior probability p is less than 0.3.
CN201810603564.2A 2018-06-09 2018-06-09 Data center network fault node diagnosis method and system based on dial testing data Active CN108933694B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810603564.2A CN108933694B (en) 2018-06-09 2018-06-09 Data center network fault node diagnosis method and system based on dial testing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810603564.2A CN108933694B (en) 2018-06-09 2018-06-09 Data center network fault node diagnosis method and system based on dial testing data

Publications (2)

Publication Number Publication Date
CN108933694A CN108933694A (en) 2018-12-04
CN108933694B true CN108933694B (en) 2021-11-09

Family

ID=64446368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810603564.2A Active CN108933694B (en) 2018-06-09 2018-06-09 Data center network fault node diagnosis method and system based on dial testing data

Country Status (1)

Country Link
CN (1) CN108933694B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109802855B (en) * 2018-12-28 2020-08-07 华为技术有限公司 Fault positioning method and device
CN110247826B (en) * 2019-07-10 2022-03-25 上海理工大学 Point-to-point network connectivity test method
CN111327491A (en) * 2020-01-20 2020-06-23 上海市大数据中心 Server-centered pessimistic diagnosis method for data center network
CN113595810B (en) * 2021-06-17 2023-09-26 国网上海能源互联网研究院有限公司 Interactive testing method and system suitable for power distribution network information
CN115914009A (en) * 2021-08-10 2023-04-04 中国移动通信集团江苏有限公司 ToB private network service quality testing method and system
CN113608072B (en) * 2021-10-06 2021-12-28 深圳市景星天成科技有限公司 Electric power self-healing rapid fault positioning method based on non-sound condition
CN114978794B (en) * 2022-05-19 2023-06-23 北京有竹居网络技术有限公司 Network access method, device, storage medium and electronic equipment
CN117114102A (en) * 2023-10-13 2023-11-24 江苏前景瑞信科技发展有限公司 Transformer fault diagnosis method based on Bayesian network and fault tree

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101715203A (en) * 2009-11-30 2010-05-26 中国移动通信集团浙江有限公司 Method and device for automatically positioning fault points
CN103856789A (en) * 2014-03-13 2014-06-11 赛特斯信息科技股份有限公司 System and method for achieving OTT service quality guarantee based on user behavior analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101715203A (en) * 2009-11-30 2010-05-26 中国移动通信集团浙江有限公司 Method and device for automatically positioning fault points
CN103856789A (en) * 2014-03-13 2014-06-11 赛特斯信息科技股份有限公司 System and method for achieving OTT service quality guarantee based on user behavior analysis

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Breadth-first heuristic search;Rong Zhou等;《Artificial Intelligence》;20060430;第170卷(第4-5期);全文 *
Fault diagnosis based on dial-test data in datacenter networks;QI Xiaogang等;《Journal of Systems Engineering and Electronics》;20191031;第30卷(第5期);全文 *
多值属性系统的故障诊断策略最优化方法;王伟等;《仪器仪表学报》;20080515(第05期);全文 *
王冰纯.基于数据析的网络诊断算法研究.《CNKI优秀硕士学位论文全文数据库》.2020,第3章. *
配电网计划孤岛划分方法研究;党克等;《中国电力》;20100905(第09期);全文 *

Also Published As

Publication number Publication date
CN108933694A (en) 2018-12-04

Similar Documents

Publication Publication Date Title
CN108933694B (en) Data center network fault node diagnosis method and system based on dial testing data
JP7116103B2 (en) Method, Apparatus, and Device for Predicting Optical Module Failure
CN106452930B (en) A kind of fault diagnosis method and system of the service function chain based on detection
CN104270268B (en) A kind of distributed system network performance evaluation and method for diagnosing faults
Tang et al. Active integrated fault localization in communication networks
CN107896168B (en) Multi-domain fault diagnosis method for power communication network in network virtualization environment
CN112383934A (en) Multi-domain cooperation service fault diagnosis method under 5G network slice
CN111884859B (en) Network fault diagnosis method and device and readable storage medium
CN109120522A (en) A kind of multipath state monitoring method and device
CN115237717A (en) Micro-service abnormity detection method and system
CN110557275B (en) Electric power communication network detection station selection algorithm based on network intrinsic characteristics
Zhang et al. Service failure diagnosis in service function chain
CN111600805A (en) Bayes-based power data network congestion link inference algorithm
CN112882875B (en) Fault diagnosis method
CN113518367B (en) Fault diagnosis method and system based on service characteristics under 5G network slice
US7719992B1 (en) System for proactive time domain reflectometry
CN107005440B (en) method, device and system for positioning link fault
Duarte Jr et al. A distributed system-level diagnosis model for the implementation of unreliable failure detectors
Chang et al. A causal model method for fault diagnosis in wireless sensor networks
CN114567471B (en) Electric power communication network safety detection system and method based on 5G
CN114448834A (en) Multiprocessor network fault node diagnosis method based on folding hypercube
Cheng et al. Probabilistic fault diagnosis for IT services in noisy and dynamic environments
Hosseini et al. Distributed fault-tolerance of tree structures
Xiaogang et al. Fault diagnosis based on dial-test data in datacenter networks
CN106713035B (en) Congestion link positioning method based on grouping test

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant