CN108933694B

CN108933694B - Data center network fault node diagnosis method and system based on dial testing data

Info

Publication number: CN108933694B
Application number: CN201810603564.2A
Authority: CN
Inventors: 齐小刚; 王冰纯; 刘立芳; 冯海林; 胡绍林
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-06-09
Filing date: 2018-06-09
Publication date: 2021-11-09
Anticipated expiration: 2038-06-09
Also published as: CN108933694A

Abstract

The invention belongs to the technical field of supervision, monitoring or testing devices, and discloses a data center network fault node diagnosis method and system based on dial-up test data, wherein a dynamic breadth-first spanning tree is generated according to the existing fault detection information and is used as a detection path between nodes; analyzing the dial-up test data based on a given prior probability p to preliminarily determine the fault probability of the network member; and selecting a reasonable threshold value through analyzing a probability distribution function to identify the fault node, and classifying the suspicious node set into a fault node set and a normal node set. The HBFD algorithm has better performance in terms of both the number of detections and the diagnostic accuracy than the HFD algorithm. The fault node in the network can be accurately identified under the condition of lower detection times in network topologies with different scales. The introduction of new methods into HBFD for the purpose of diagnosing malicious nodes or other types of failures in a network is also of some research value.

Description

Data center network fault node diagnosis method and system based on dial testing data

Technical Field

The invention belongs to the technical field of supervision, monitoring or testing devices, and particularly relates to a data center network fault node diagnosis method and system based on dial-up test data.

Background

Currently, the current state of the art commonly used in the industry is such that:with the advent of the big data age, the increasing demand of cloud computing has enabled the scale of data center networks to be expanded. Today, data center networks contain hundreds of thousands of servers connected by Network Interface Cards (NICs), switches and routers, cables and fibers, which are mostly distributed and high in heightAnd (4) flow rate. In large systems, detecting and locating faults is important for network management systems to restore network communications through a fault recovery mechanism. Although there are many studies devoted to the fault diagnosis strategy, the problems described below still need to be solved. 1) Complexity of diagnosis: in addition to the higher time and space complexity for fault localization, the increase in network size may also lead to more complex fault diagnosis due to dynamic, incomplete and uncertain information. Therefore, it is significant to effectively reduce the number of detections and detection efficiency of the failure diagnosis. 2) The network load increases: data centers may significantly shorten algorithm execution times, but may also result in an increased likelihood of controller overhead. One possible solution is to apply a participation strategy to control the number of monitors, while another strategy is to increase the effectiveness of the probe data to reduce the amount of data. The existing network fault diagnosis technology is mainly divided into three categories: passive fault detection, active fault detection and fault mode identification method based on network log. The passive fault detection method monitors the real-time performance of the network by deploying a monitoring agent in the network, and passively acquires the state information of network members. A method of placing passive monitoring devices on specific links in a network, where the agents monitor the current status of the members of the network by facilitating all links in the network at a given time, but this method has the disadvantage of generating redundant monitoring agents in large-scale networks. A passive fault diagnosis method using a dependency graph can only detect a limited number of fault nodes at one time, and is not suitable for the environment of a large-scale network. Bayesian Belief Networks (BBNs) are also widely used in fault detection technology. The BBN models a network structure into a directed acyclic graph, and then tries to find a failed node by analyzing end-to-end observable symptoms, and fault reasoning has higher time complexity in a large-scale network, so that a fault management system cannot timely and effectively recover network communication. Active fault diagnosis typically uses probes to detect the condition of the server, these selected probes being transmitted to obtain end-to-end statistics such as packet loss rate, delay and throughput. The controller then collects these statistics to obtain further information aboutActive detection reasoning, it is very important to design a suitable and effective detection strategy. A fault diagnosis system architecture using adaptive probing. Most probe-based technologies contain three components: detection station selection, detector selection and fault reasoning. But these approaches are limited by traffic overhead in large-scale networks. A phased test method for reducing network traffic overhead uses only a small set of probes to detect a small area of the network at each stage. However, how to find the reasonable arrangement probe station and the problem of the failure of the probe station are still to be discussed further. A probe station selection algorithm to minimize the number of probe stations and make the probe stations robust against failure. However, the problem of how to place a probe station to monitor for a faulty station remains unsolved. With the development of big data technology, fault diagnosis technology based on log data draws a lot of attention. Network system log based techniques are typically based on a threshold algorithm that first sets appropriate thresholds for different detection capabilities of the network based on the experience of the network administrator, and then detects faults by comparing actual values to default thresholds. This technique is simple but has two distinct disadvantages: 1) its threshold is empirically chosen; 2) data below the threshold is not analyzed, resulting in that some detailed information about the network condition may be missed. A novel analysis system for active fault diagnosis not only considers keywords of abnormal logs such as errors and decline, but also tries to find a sudden fault mode. However, data-based algorithms have a high temporal complexity in terms of data preprocessing (such as data extraction, data cleanup, and exception handling).

In summary, the problems of the prior art are as follows:

(1) the passive fault detection method can generate redundant monitoring agents in a large-scale network, so that a plurality of useless detection packets exist in the network, when the network has a large scale and is busy in service, the redundant detection packets can influence the normal service of the network and even influence the network fault diagnosis result, and the passive fault diagnosis is not suitable for the environment of the large-scale network. Therefore, in a large-scale network, on one hand, an active fault monitoring technology can be adopted, and redundant detection packets in the network can be effectively reduced. On the other hand, it is necessary to improve the effectiveness of the probe packet and reduce the base number of the probe for performing fault diagnosis and detection.

(2) The active fault diagnosis is limited by the flow overhead in a large-scale network, a reasonable and effective detection base station needs to be placed in the network, the position and the number of the detection base station directly influence the accuracy of a fault diagnosis result, but the existing research does not solve the problems of the detection base station and the number. In a large-scale network, designing a probe path covering the network has great time complexity, and recalculation is needed when the network topology changes, so that the method is not suitable for a dynamic network structure.

(3) The decision threshold is empirically selected in the web log based technique; on the one hand, since the technology does not analyze data below the threshold, some detailed information about the network condition may be missed. When the network state has unpredictable mutation, the current state of the network cannot be accurately judged by the experience threshold, so that the network fault management system cannot acquire fault information in the network. The analysis of all fault data has the problems of large time complexity and more redundant information.

The difficulty and significance for solving the technical problems are as follows:in a large-scale data center network, a passive fault detection technology is insufficient in real-time performance and effectiveness, and an active fault detection technology has a problem of how to select a detection base station and a detection path. For a large-scale complex network, the problem that a sending probe traverses a network path once is an NP-hard problem, recalculation is needed for each network change, and great limitations are caused to network topology reconstruction, optimization and the like, so that a new reasonable and effective fault diagnosis mode is very necessary in engineering. On the other hand, when judging whether a network node fails, the traditional method relying on manual experience also has great limitation, so that establishing a proper model and selecting a proper threshold value for different network structures is also very significant in research.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a data center network fault node diagnosis method and system based on dial testing data.

The data center network fault node diagnosis method based on the dial-up test data generates a dynamic breadth-first spanning tree as a detection path between nodes according to the existing fault detection information; analyzing the dial-up test data based on a given prior probability p to preliminarily determine the fault probability of the network member; and selecting a reasonable threshold value through analyzing a probability distribution function to identify the fault node, and classifying the suspicious node set into a fault node set and a normal node set.

Furthermore, the data center network fault node diagnosis method based on dial test data gives a dynamic spanning tree and a test state s between any pair of adjacent nodes_iWith two indicating variables s ═ r_ij,r_jiIs composed of (i) a compound of formula (i) }, wherein r_ij(r_ji) Is the result of testing node i (j) for node j (i); r is_ij0 means that node i identifies node j as a normal node, r _ij1 means that node j is identified by node i as a fault condition; s_iThe composed matrix is called the symptom matrix S.

Further, the data center network fault node diagnosis method based on dial testing data specifically comprises the following steps: dynamic spanning tree, failure probability evaluation and failure reasoning;

dynamically generating a detection tree according to the last detection result by the dynamic spanning tree search;

evaluating and quantifying the fault probability of each suspicious node by the fault probability;

fault reasoning puts absolute fault nodes into a fault group; and selecting a proper threshold according to the fault probability table, and dividing the suspicious nodes into a relatively fault group and a relatively normal group.

Further, the dynamic spanning tree search is based on a heuristic breadth-first search algorithm, wherein N is a group of nodes in the network, NF is a normal node set, and F is a fault node set;

step one, F ← the last time detection result;

in the second step, the first step is that,

turn to step three; else, turn to step four;

step three, the fibre the break-first mapping tree by the structural algorithm;

step four, NF ← N-F, use NF as the initial searching nodes.

Further, the fault probability of each node is evaluated according to the fault probability table, and a decision function psi is adopted_cDetermining the final fault probability of each node in one detection;

obtaining the unique probability of each node failure; and determining the final fault node through fault reasoning.

Further, the fault reasoning considers that the node is a fault node when the fault probability of the node is greater than 0.5.

Another object of the present invention is to provide a data center network fault node diagnosis system based on dial test data for implementing the data center network fault node diagnosis method based on dial test data, where the data center network fault node diagnosis system based on dial test data includes:

the priority spanning tree module is used for generating a dynamic breadth priority spanning tree as a detection path between nodes according to the existing fault detection information;

the failure probability determination module is used for analyzing the dial-up test data based on the given prior probability p to preliminarily determine the failure probability of the network members;

and the classification module is used for selecting a reasonable threshold value through analyzing a probability distribution function to identify the fault node and classifying the suspicious node set into a fault node set and a normal node set.

The invention also aims to provide a data center network system applying the data center network fault node diagnosis method based on dial testing data.

In summary, the advantages and positive effects of the invention are:the problems of selecting the number of the detection base stations and placing the detection base stations in the network are solved by adopting a point-to-point detection technology, and only a reliable data processing center exists in the network. In the aspect of detecting the path, the characteristic of larger node degree in the data center network is combined, the breadth-first search path is generated through the breadth-first algorithm to obtain the dial-up test data, and the calculation cost caused by network topology reconstruction and optimization is reduced. An effective probability distribution function is designed by combining probability calculation, and a reasonable judgment threshold is selected on the basis of the probability distribution function, so that the influence of human experience is effectively avoided. Experimental results show that compared with the HFD algorithm, the HBFD algorithm has better performance in terms of both the number of detections and the diagnostic accuracy. The fault node in the network can be accurately identified under the condition of lower detection times in network topologies with different scales. The method is a fault diagnosis technology more suitable for large-scale data center networks.

Drawings

Fig. 1 is a flowchart of a data center network fault node diagnosis method based on dial-up test data according to an embodiment of the present invention.

Fig. 2 is a flowchart of an implementation of a data center network fault node diagnosis method based on dial-up test data according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of fault detection provided by the embodiment of the present invention.

Fig. 4 is a diagram illustrating the comparison result between the HBFD algorithm and the HFD algorithm according to the embodiment of the present invention.

Fig. 5 is a schematic diagram of the impact of the fault granularity provided by the embodiment of the present invention.

Fig. 6 is a schematic diagram of the network size influence provided by the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The rapid growth in size and structural complexity of data center networks has made network management systems increasingly challenging. Since network failures (node failures or link failures) are inevitable, finding a method for rapidly diagnosing network failures so as to be able to effectively recover network communication functions has become an important research topic in academia and industry. The method can quickly and effectively monitor the node faults of the data center network, and determine the fault nodes by 1) generating a dynamic search tree, 2) analyzing dial-up test data and 3) selecting a reasonable threshold value on the basis of a heuristic breadth-first diagnosis algorithm. Simulation results show that the HBFD algorithm can effectively diagnose node faults and effectively reduce detection times and false alarm rate under the condition of ensuring diagnosis accuracy.

As shown in fig. 1, the data center network fault node diagnosis method based on dial-up test data according to the embodiment of the present invention includes the following steps:

s101: generating a dynamic breadth-first spanning tree as a detection path between nodes according to the existing fault detection information;

s102: analyzing the dial-up test data based on a given prior probability p to preliminarily determine the fault probability of the network member;

s103: and selecting a reasonable threshold value through analyzing a probability distribution function to identify the fault node, and classifying the suspicious node set into a fault node set and a normal node set.

The application of the principles of the present invention will now be described in further detail with reference to the accompanying drawings.

1. Algorithm overview

The invention reflects the network of the real world through the undirected graph, and consists of a group of nodes N, wherein the nodes are connected by a group of links L, and the data center network has the characteristic of high connectivity. The current state of the art is considered to fully allow elements in a network to communicate with each other; the invention obtains the real-time state information of the nodes through mutual testing between adjacent nodes. The detection method firstly designs an algorithm for selecting a quick and effective detection path so as to achieve the purposes of saving diagnosis time and resource consumption. Therefore, in consideration of the characteristic of high connectivity of a data center network, the invention firstly generates a dynamic breadth-first spanning tree as a detection path between nodes according to the existing fault detection information, then analyzes dial-test data based on a given prior probability p to preliminarily determine the fault probability of network members, finally selects a reasonable threshold value through an analysis probability distribution function to identify fault nodes, and classifies a suspicious node set into a fault node set and a normal node set.

HBFD mainly consists of three parts: dynamic Spanning Tree (DSTS), Failure Probability Evaluation (FPE) and Failure Reasoning (FR). The main idea of each process is briefly described below.

Definition 1: given a dynamic spanning tree, the test state s between any pair of adjacent nodes_i(also called symptoms) are represented by two indicator variables s ═ { r ═ r_ij,r_jiIs composed of (i) a compound of formula (i) }, wherein r_ij(r_ji) Is the result of testing node i (j) for node j (i). Here, rij ═ 0 means that the node i identifies the node j as a normal node, and rij ═ 1 means that the node j is recognized as a failure state by the node i. s_iThe composed matrix is called the symptom matrix S.

Dynamic Spanning Tree Search (DSTS) dynamically generates a probe tree based on the last probe result. Therefore, the HBFD can avoid detecting other suspicious nodes by using the failed node as much as possible, and effectively avoid uncertainty of the detection result (for example, when the failed node i detects other nodes, there is a possibility that the failed node is marked as a normal state in the symptom result, so that the detection result has a large false alarm rate).

Failure Probability Evaluation (FPE) is used to quantify the failure probability of each suspect node. As shown in fig. 3, the failure probability of the node n depends not only on its prior probability (probability of correct diagnosis of the failed node), but also is related to the diagnosis results of other nodes. And the FPE obtains the initial detection fault probability of each node according to the fault probability table, and then calculates the final fault probability of the node through a decision function. The higher the probability of ultimate failure, the more likely node n is to fail.

Finally, the failure probability of all suspicious nodes is further analyzed by a Failure Reasoning (FR) part. First, the FR puts the absolute failed node (whose failure probability equals 1) into the failed group. Then, the FR selects an appropriate threshold value according to the failure probability table, and divides the suspicious nodes into a relatively failed group (F) and a relatively normal group (NF).

In view of the applicability of the HBFD algorithm to data center networks of different architectures, the present invention considers randomly generated connected network topologies. The present invention also assumes that no malicious nodes exist in the network (error messages are generated with probability 1) and that at least one trusted management controller is used to collect dial-up test data and execute the HBFD algorithm.

2. Heuristic fault diagnosis

The present invention eliminates the assumption of any particular network topology altogether, although there may be particular optimizations for different network topologies. The present invention also does not take into account malicious node and link failures and considers it feasible to deploy the HBFD algorithm in different data center network topologies (e.g., VL2, ficeon and DCell). Although HBFD is a logical single entity, it can still be implemented in a distributed manner by analyzing the distributively stored dial-up test data. The invention also assumes that at least one trusted management controller (AC) in the network is used to collect and analyze dial-up test data.

2.1 problem analysis

Failures are network events that cause a source of network communication problems. Node failures occur when a device is not available to route or forward traffic. Node failures may be caused by many factors, such as, for example, a device being powered off for repair or crash due to hardware errors, a packet being dropped or a timeout response when network traffic is too large, all of which may cause uncertainty in the detection result.

Active probing sends one or more data packets to nodes in the network to detect the real-time status of each node. According to the PMC model definition, a set of test nodes will produce 6 different probing results: 1) the test results of the two normal nodes are that both nodes are in a normal operating state (e.g., rij 0, rji 0); 2) when a normal node detects a failed node, the result must be that the node is a failed node (e.g., rij 0, rji 1); 3) -6) the probing junction of the failed node is indeterminate regardless of the state of the node under test (e.g., r ═ {0,1 }). The present invention defines the behavior of a failed node as a symptom (e.g., r ═ 1). Fig. 3 gives two simple examples to illustrate the cause of probing uncertainty caused by a failed node in the network.

Example 1: in fig. 3-a, { a, b, c } are three nodes of the network, assuming that nodes a and c are two failed nodes and node b is a normal node. Then the result of the detection of { a, b } is any one of the sets { (0,1), (1,1) }, and the result of { b, c } is one of the sets { (1,0), (1,1) }. When the symptom group is { (0,1), (1,0) }, the node b is considered to be in a good state; however, when the symptom groups are { (1,1), (1,1) }, node b is considered to be the failed node, and the remaining combinations are not sufficient to determine the condition of node b.

Example 2: in fig. 3-b, when node a fails, the states of the four nodes are indeterminate regardless of the state of nodes { b, c, d }.

Definition 2: and the fault node acquires a correct detection result according to a given probability, and the probability is defined as a prior probability p.

As given a priori probability p, for a failed node n_iWhen it detects a normal node, it obtains r with a probability p _ij0 and r is obtained with a probability of 1-p_ijSymptoms of 1.

In the case of a given network, when the network conditions (e.g., connectivity, time delay, and packet loss rate) are relatively good, the prior probability p is high, and thus the dial-up test data will be more efficient. The detection result is completely accurate when p is 1, but it exists only in the ideal case.

2.2 dynamic spanning Tree searching

In order to detect node faults, the characteristic of high connectivity of a data center network is combined, and the breadth-first spanning tree is generated by combining the current detection information to be used as a path for mutual detection of nodes. Even if the prior probability is introduced to quantify the uncertainty of the detection result, the detection result of the fault node is still uncertain, and the dial test data obtained by one detection is not enough to find all fault nodes. A point-to-point fault diagnosis algorithm called HFD discovers most fault nodes through multiple detections, and the algorithm does not reasonably utilize state information obtained after each detection to reduce uncertainty of each detection result, so that the detection times are large, and the algorithm is not suitable for large-scale data center networks.

The data center network has the characteristic of high connectivity. Therefore, for a large-scale data center network, the search of the spanning tree by using a breadth-first search (BFS) algorithm can effectively improve the speed of searching the detection path. BFS may also effectively avoid situations where a single diagnosis causes a high false alarm rate, as shown in fig. 3. Based on BFS, the invention designs a heuristic breadth first search algorithm (HBFS) as shown in algorithm 1, wherein N is a group of nodes in the network, NF is a normal node set, and F is a fault node set.

The heuristic breadth-first search algorithm dynamically uses the information of the absolute fault node (fault probability is 1) contained in the previous detection result to find a new spanning tree which avoids detecting other nodes by using the absolute fault node. By reducing the uncertainty of the detection result and the communication frequency, the HBFS algorithm effectively improves the detection accuracy and the detection speed.

2.3 Fault probability assessment

For a randomly generated network topology structure, a dialing test result 0 is defined as that the node is in a symptom state, and a result 1 indicates that the detected node is a fault node. Then, the detection results and the corresponding failure probabilities shown in table 1 are obtained.

Table 1: fault probability table

The failure probability of each node may be obtained from a failure probability table. On the basis of the breadth-first spanning tree search path,the probe result corresponding to each node is uncertain. The invention thus designs the Decision Function (DF) ψ on the basis of the probability of a fault_cTo determine the final failure probability of each node in a probe.

The unique probability of failure of each node can be obtained by DF. And finally, determining the final fault node through fault reasoning.

2.4 Fault reasoning

Having obtained the final failure probability for each node, the absolute failed node (e.g., # is first determined_c(. 1) is put into the failure node set (F), and then two appropriate thresholds are selected according to the failure probability table to determine other failure nodes. For the four detection results in table 1, the following analyses were performed:

1) the result of num.1 indicates that the nodes probed each other are in the same state. And when the prior probability p is more than or equal to 0.3 and the fault probability of the two nodes is lower than 0.5, the two nodes are considered to be in a normal state. The probability of failure is higher than 0.5 when the prior probability p < 0.3. This may cause the decision function to assign a higher probability of failure to the normal probing nodes, meaning that there are malicious nodes that provide false information to interfere with the failure diagnosis process.

2) The detection results of num.2 and num.3 indicate that there is an absolute fault node between the detection nodes (i.e. the detection fault probability is 1), and the fault probability of the other node is very low regardless of the prior probability p, so that the node with the fault probability of 1 is considered as an absolute fault node, and the other node is in a normal state.

3) The result of num.4 shows that the probability is always higher than 0.5, but it cannot be determined whether at least one node fault exists in the inter-test nodes, so that both inter-test nodes are marked as suspicious nodes to perform the next detection for further determination.

In summary, when the failure probability of a node is greater than 0.5, the node is considered as a failed node, and the prior probability p is less than 0.3. And when the next detection is carried out, the analysis of the last detection result is combined, so that the situation that a plurality of other network members are simultaneously detected by adopting a fault node is effectively avoided, and the uncertainty of the detection result is reduced so as to improve the detection speed.

The application effect of the present invention will be described in detail with reference to the simulation.

1. As an active point-to-point fault diagnosis algorithm based on data center network dialing test data, HBFD can be flexibly deployed and integrated with the existing routing protocol to improve the accuracy of fault diagnosis and reduce the monitoring cost. Simulation results and performance of the algorithm proposed by the present invention.

The evaluation index includes: (1) number of probes (DN); (2) fault Granularity (FG); (3) correct Diagnostic Rate (CDR); (4) false alarm rate (FDR). The detection times are used for measuring the communication times of the dialing test data acquired between the nodes in the fault detection process. The lower the DN value the better the algorithm. The PF represents the true failed node. The failure granularity is the ratio of the network size to the network size.

The correct diagnosis rate and the incorrect diagnosis rate are two evaluation indexes for measuring the accuracy of the fault diagnosis algorithm, and are written as follows:

2. simulation result

Considering the applicability of the HBFD algorithm, generating a network topology and randomly selecting a failed node, the performance of the HBFD algorithm will be analyzed from three aspects as shown in fig. 4-6 as simulation results. 1) Advantages of HBFD the advantages of the HBFD algorithm compared to the Hierarchical Fault Diagnosis (HFD) algorithm are shown in fig. 4. In a network topology containing 5000 nodes, the prior probability value is 0.3, and the detection granularity ranges from 0.05 to 0.5. It is evident from the figure that the HBFD algorithm improves the number of tests and the correct diagnosis rate very well. This also means that avoiding the use of a failed node to detect other nodes can effectively improve the accuracy of the diagnosis and effectively reduce invalid communications. However, as a disadvantage, HBFD may cause a very low false alarm rate as shown in fig. 4, and the cause of the false alarm rate will be analyzed below.

2) Effect of Fault granularity

As shown in fig. 5, FG is set in the range of [0.1,0.4 ]. The accuracy of the diagnosis is still very stable. But as the granularity of the fault increases, the false alarm rate slightly increases. This is because a higher granularity of failure indicates more failed nodes in the network. HBFS cannot find a dynamic spanning tree without failed nodes. As described in table 1, the fourth symptom appears with a higher probability. Therefore, several normal nodes are determined as the failed nodes.

3) Influence of network size

Different network size ranges [1000, 5000] and the same fault granularity 0.5 are given, as shown in fig. 6. It is clear that the correct diagnosis rate is stable in networks of different sizes, and as the size of the network increases, the false detection rate also decreases. These results indicate that HBFD is robust in networks of different sizes and has a low false diagnosis rate.

The most probable fault node is detected and positioned by the point-to-point fault diagnosis algorithm based on dial testing data; the HBFD algorithm has better performance in terms of both the number of detections and the diagnostic accuracy than the HFD algorithm. The fault node in the network can be accurately identified under the condition of lower detection times in network topologies with different scales. In subsequent work, on one hand, the root cause of the error alarm rate of the algorithm is continuously analyzed, and the error alarm rate of the HBFD algorithm is tried to be eliminated; on the other hand, the HBFD algorithm is combined with different transmission protocols in consideration of combination with a more real network topology. It is also of value to introduce new methods into HBFD in order to diagnose malicious nodes or other types of failures in the network.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A data center network fault node diagnosis method based on dial-up test data is characterized in that the data center network fault node diagnosis method based on dial-up test data generates a dynamic breadth-first spanning tree as a detection path between nodes according to the existing fault detection information; analyzing the dial-up test data based on a given prior probability p to preliminarily determine the fault probability of the network member; selecting a reasonable threshold value through analyzing a probability distribution function to identify a fault node, and classifying a suspicious node set into a fault node set and a normal node set; the fault node acquires a correct detection result according to a given probability, and the probability is defined as a prior probability p;

when the fault probability of the node is greater than 0.5, the node is considered as a fault node, and the prior probability p is less than 0.3;

the data center network fault node diagnosis method based on dial test data gives a dynamic spanning tree and a test state s between any pair of adjacent nodes_iWith two indicating variables s ═ r_ij,r_jiIs composed of (i) a compound of formula (i) }, wherein r_ijIs the result of node i testing node j; r is_jiIs the result of node j testing node i; r is_ij0 means that node i identifies node j as a normal node, r_ij1 means that node j is identified by node i as a fault condition; s_iThe composed matrix is called symptom matrix S;

the data center network fault node diagnosis method based on dial testing data specifically comprises the following steps: dynamic spanning tree, failure probability evaluation and failure reasoning;

fault reasoning puts absolute fault nodes into a fault group; selecting a proper threshold value according to the fault probability table, and dividing suspicious nodes into a relative fault group and a relative normal group;

the dynamic spanning tree search is based on a heuristic breadth-first search algorithm, wherein N is a group of nodes in a network, NF is a normal node set, and F is a fault node set;

the method comprises the following steps: obtaining the absolute fault node of the last detection result and assigning the absolute fault node to the F, and turning to the second step;

step two: if the F is not empty, skipping to the third step, otherwise, turning to the fourth step;

step three: obtaining a breadth-first tree by combining with network topology, starting primary detection, determining the fault probability of the network node in the current detection by combining with a fault probability table and a decision function, determining an absolute fault node, and turning to the step four

Step four: if the absolute fault node exists, turning to the step 1, if the absolute fault node does not exist, finishing detection, and outputting all nodes with the fault probability larger than 0.5 as fault nodes;

wherein the fault probability evaluation of the fault probability of each node is obtained according to a fault probability table by adopting a decision function psi_cDetermining the final fault probability of each node in one detection;

f(n_j)＝max{e_ij|e_ij∈E}(n_j∈N)

obtaining the unique probability of each node failure; and determining the last fault node through fault reasoning.

2. A data center network fault node diagnosis system based on dial-up test data for implementing the data center network fault node diagnosis method based on dial-up test data according to claim 1, wherein the data center network fault node diagnosis system based on dial-up test data comprises:

the failure probability determination module is used for analyzing the dial-up test data based on the given prior probability p to preliminarily determine the failure probability of the network members; the fault node acquires a correct detection result according to a given probability, and the probability is defined as a prior probability p;

the classification module is used for selecting a reasonable threshold value through analyzing a probability distribution function to identify a fault node and classifying a suspicious node set into a fault node set and a normal node set; and when the failure probability of the node is more than 0.5, the node is considered as a failure node, and the prior probability p is less than 0.3.