CN113938407B - Data center network fault detection method and device based on in-band network telemetry system - Google Patents

Data center network fault detection method and device based on in-band network telemetry system Download PDF

Info

Publication number
CN113938407B
CN113938407B CN202111027721.8A CN202111027721A CN113938407B CN 113938407 B CN113938407 B CN 113938407B CN 202111027721 A CN202111027721 A CN 202111027721A CN 113938407 B CN113938407 B CN 113938407B
Authority
CN
China
Prior art keywords
network
fault
path
preset
int
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111027721.8A
Other languages
Chinese (zh)
Other versions
CN113938407A (en
Inventor
潘恬
贾晨昊
许凯
宋恩格
张志龙
黄韬
刘韵洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202111027721.8A priority Critical patent/CN113938407B/en
Publication of CN113938407A publication Critical patent/CN113938407A/en
Application granted granted Critical
Publication of CN113938407B publication Critical patent/CN113938407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/28Routing or path finding of packets in data switching networks using route fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/34Source routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04QSELECTING
    • H04Q9/00Arrangements in telecontrol or telemetry systems for selectively calling a substation from a main station, in which substation desired apparatus is selected for applying a control signal thereto or for obtaining measured values therefrom
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the application provides a fault detection method and a related device of a data center network based on an in-band network telemetry system, wherein the method comprises the following steps: generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database in a form of a path information table; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the detection server, rerouting through the source route; and collecting fault path information uploaded by a plurality of servers, and carrying out network fault positioning in a centralized way. The method and the device can solve the problems that multiple faults occur in the data center network at the same time, and quick detection and positioning aiming at the multiple faults are needed.

Description

Data center network fault detection method and device based on in-band network telemetry system
Technical Field
The application relates to the field of data center network fault detection, in particular to a data center network fault detection method based on an in-band network telemetry system and a related device.
Background
Data centers play a vital role in information acquisition, dissemination, computation, storage and management today. To meet the ever increasing and changing demands of users, data centers are increasingly being supported by larger scale, more dense data center networks (Data Center Network, DCN).
In the related art, the gray faults in the network faults are relatively complex and have strong destructiveness. Often grey faults occur not only at one location but at multiple locations throughout the data center network.
Aiming at the problems of rapid detection and positioning of multiple faults in a data center network in the related art, no effective solution is proposed at present.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a fault detection method and device for a data center network based on an in-band network telemetry system, which can solve the technical problems that a plurality of faults occur in the data center network at the same time, and the faults need to be rapidly detected and positioned.
In order to solve at least one of the above problems, the present application provides the following technical solutions:
In a first aspect, the present application provides a method for detecting a failure of a data center network based on an in-band network telemetry system, including: generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database in the form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the detection server, rerouting through the source route; and collecting fault path information uploaded by a plurality of servers, and carrying out network fault positioning in a centralized way.
Further, the rerouting through source routing further includes: adding a source route field into a data packet of the data center network; and determining the path forwarded by the data packet according to the source routing field, deleting the aged path, updating the path information table, and then recovering the link when the network fault exists in the server.
Further, the detecting whether the network fault exists in the server based on the preset aging time in the path information table includes:
detecting whether the INT detection packet is received by the receiving end or not based on the preset aging time of the path in the path information table;
if the INT detection packet is detected not to be received by the receiving end, deleting the path and considering that the network fault exists in the server, wherein the network fault at least comprises one of the following steps: and the link failure that the detection packet cannot reach the receiving end due to the congestion of the link, the link failure that the detection packet cannot reach the receiving end and the path stored on the receiving end is aged.
Further, after detecting whether the network failure exists in the server based on the preset aging time in the path information table, the method further includes: obtaining a repair evaluation result according to a network fault positioning strategy, wherein the network fault at least comprises a gray fault; detecting the network fault in the data center according to a preset probability model, a preset error scale and a K value of a preset Top K algorithm; the preset probability model comprises a first probability model for setting different probabilities of occurrence errors according to the conditions of links, and a second probability model with unified probabilities of occurrence errors according to all links; the preset error scale is used as a ratio of different error scales, wherein the ratio of the error scales = the number of error link stripes/the total stripes in the topology; the K value of the Top K algorithm is preset, and the K value is determined according to the number of fault repairing rounds, time consumption of each round and accuracy.
Further, the message format of the INT probe packet comprises an Ethernet header and an IP header, and the information format of the collected INT probe packet comprises a switch ID, an inlet port number and an outlet port number.
Further, the centralized network fault location by collecting fault path information uploaded by a plurality of servers includes: obtaining link information change by monitoring a Redis database, and placing the link information with the change into a target set, wherein the target set comprises a set of vanishing paths and a set of repairing paths; and calculating the probability of the error link according to the data in the vanishing path set.
Further, the method further comprises the following steps: calculating the fault path through a fault tree analysis algorithm; calculating the error position in the link through a fault tree analysis algorithm; and analyzing and positioning the error position in a concentrated manner through a fault tree analysis algorithm and a plurality of fault paths, and optimizing in the process of the fault tree analysis algorithm according to a minimum cut set and a Top K algorithm.
In a second aspect, the present application provides a method. A data center network fault detection device based on an in-band network telemetry system, comprising: the INT detection module is used for generating an INT detection packet based on an in-band network telemetry system and forwarding the INT detection packet in the data center network according to a preset detection path; the analyzing module is used for analyzing the INT detection packet through a receiving end in the data center network and storing the INT detection packet in a local database in a form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults; the network fault detection module is used for detecting whether the network fault exists in the server or not based on the preset aging time in the path information table; the rerouting module is used for rerouting through the source route when the network fault exists in the server; and the fault positioning module is used for intensively positioning network faults by collecting fault path information uploaded by a plurality of servers.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for fault detection of a data center network based on an in-band network telemetry system when the program is executed.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for fault detection in a data center network based on an in-band network telemetry system.
According to the technical scheme, the application provides a fault detection method and a related device of a data center network based on an in-band network telemetry system, wherein an INT detection packet is generated based on the in-band network telemetry system, and the INT detection packet is forwarded in the data center network according to a preset detection path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database in the form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the detection server, rerouting through the source route; and collecting fault path information uploaded by a plurality of servers, and carrying out network fault positioning in a centralized way. According to the method and the device, the whole network telemetry based on the network topological characteristics of the data center is realized, the flow is timely adjusted aiming at the network faults, and the network fault positioning including gray faults is realized. Therefore, the problems that multiple faults occur in the data center network at the same time and rapid detection and positioning aiming at the multiple faults are needed are solved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic system architecture diagram of a fault detection method of a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 2 is a flow chart of a fault detection method of a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 3 is a schematic structural diagram of a fault detection device of a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 4 is a schematic diagram of a Fat-tree architecture topology of a data center network fault detection method based on an in-band network telemetry system in an embodiment of the present application.
Fig. 5 is a schematic diagram of an INT probe packet format of a data center network fault detection method based on an in-band network telemetry system in an embodiment of the present application.
Fig. 6 is a schematic diagram of an INT detection flow of a fault detection method for a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 7 is a schematic diagram of a general packet format of a fault detection method of a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 8 is a schematic diagram of a normal packet forwarding flow of a data center network fault detection method based on an in-band network telemetry system in an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The inventor finds that in the related technology, a plurality of communication paths are planned according to network topology by using a remote centralized controller, TCP or HTTP ping sent by a server is used, and whether the corresponding links are normal or not is deduced according to the time of connection establishment. Although for detecting most network faults. But is limited in that it is difficult to precisely locate the faulty device in a grey fault scenario unless traceroute technology is further used, i.e. the sent ping packet reports the current status to the remote centralized controller every time it passes a switching device.
In addition, in other related art, a probe packet is transmitted through an IP-in-IP tunnel by means of a server, and the probe packet is returned through an original path after reaching a destination switch. After the detection packet is received by the receiving end, according to the ratio of the number of the received detection packets to the number of the sent detection packets, a certain learning algorithm is used for deducing which link in the network is the most likely position where the network fault occurs. However, due to the detection of the on-off failure of the link, the failure cannot be detected under the condition of partial gray failure, such as the packet loss of the IP data packet of the specific destination end. Meanwhile, as a machine learning algorithm is adopted to infer the fault position, zero false positive and zero false negative cannot be ensured theoretically.
In consideration of the simultaneous occurrence of multiple faults in a data center network, quick detection and simultaneous occurrence of multiple faults in the data center network are required, the application provides a fault detection method of the data center network based on an in-band network telemetry system, which can realize full-network telemetry on the topology characteristics of the data center network, timely adjust the flow according to the network faults and realize network fault positioning including gray faults.
Referring to fig. 1, a schematic system architecture of a fault detection method of a data center network based on an in-band network telemetry system in an embodiment of the present application is shown, where the method specifically includes: an in-band whole network telemetry system, a network fault detection and fast reroute system, and a network fault notification and positioning system.
The server periodically transmits an INT (In-band network telemetry ) probe packet to check whether all paths between the source and destination are feasible, the INT probe packet after passing through the switch will record the corresponding switch ID, ingress port number and egress port number, and the switch multicasts the probe packet according to a certain forwarding rule to cover all feasible paths.
Each probe packet receiving end server stores these feasible paths in a path information table, each path being accompanied by an aging time. According to the path information table, the server may send the data packet through a Source Routing (SR) through a specified feasible path. When a network failure occurs, the relevant paths in the path information table will age because the probe packet cannot pass through the failed link, and these paths will not be selected as the paths to send traffic. In addition, the aged path may be reported to the remote centralized controller as path failure information.
Because path failure information on a single server is not sufficient for accurate network failure localization, the controller needs to receive path failure information from all affected servers and perform centralized analysis. More specifically, the controller needs to find the commonality between these aged paths until the failure converges to a link between the two devices.
The application provides an embodiment of a fault detection method of a data center network based on an in-band network telemetry system, referring to fig. 2, the fault detection method of the data center network based on the in-band network telemetry system specifically comprises the following contents:
step S201, an INT detection packet is generated based on an in-band network telemetry system, and the INT detection packet is forwarded in the data center network according to a preset detection path;
step S202, analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database in the form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults;
step S203, based on the preset aging time in the path information table, detecting whether the network fault exists in a server;
Step S204, if the network fault exists in the detection server, rerouting is carried out through the source route;
step S205, by collecting fault path information uploaded by a plurality of servers, network fault positioning is performed in a centralized manner.
As can be seen from the above description, by generating an INT probe packet based on an in-band network telemetry system, forwarding the INT probe packet in the data center network according to a preset probe path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database in the form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the detection server, rerouting through the source route; and collecting fault path information uploaded by a plurality of servers, and carrying out network fault positioning in a centralized way. According to the method and the device, the whole network telemetry based on the network topological characteristics of the data center is realized, the flow is timely adjusted aiming at the network faults, and the network fault positioning including gray faults is realized. Therefore, the problems that multiple faults occur in the data center network at the same time and rapid detection and positioning aiming at the multiple faults are needed are solved.
In-band network telemetry (In-band Network Telemetry, INT for short) In step S201 is a fine-grained network measurement architecture that is primarily used to collect real-time network link status In the data plane without excessive participation by the control plane. In the INT model, a transmitting end transmits special INT probe packets for full network traversal, and when the INT probe packets pass through devices with INT information acquisition functions along a probe path, the devices insert corresponding INT information into the probe packets. The INT detection packet is finally collected and processed by a remote centralized controller for realizing whole network congestion control, load balancing, network fault detection and the like.
Because INT has the function of detecting the network link state, INT can be used for detecting the network link on-off condition in real time, and gray fault discovery and positioning can be performed based on the INT.
In order for an INT probe packet to cover the whole network, specific forwarding rules need to be set within the switch according to a specific network topology. The method is based on the scheme of full network detection on the data center network in the prior art, such as the three-layer Fat-tree architecture data center network shown in fig. 4, and forwarding rules set in a switch are performed according to the following modes:
For a ToR switch, when it receives a probe packet from a server, the probe packet will be forwarded to all Leaf switches connected to it (e.g., fig. 4, sever1→t1→l1 and L2). When it receives a probe packet from the Leaf switch, the probe packet will be forwarded to all servers connected to it (e.g., FIG. 4, L3→T3→Servers 5 and Servers 6).
For a Leaf switch, when it receives a probe packet from a ToR switch, the probe packet will be forwarded to all ports except the ingress port (e.g., fig. 4, t1→l1→s1, S2, and T2). When it receives a probe packet from a Spine switch, the probe packet will be forwarded to all ToR switches connected to it (e.g., fig. 4, s1→l3→t3 and T4).
For a Spine switch, when it receives a probe packet from a Leaf switch, the probe packet will be sent to all ports except the ingress port (e.g., fig. 4, l1→s1→l3).
As a preferred embodiment of the present application, the following message format is adopted for the INT probe packet, where the message format includes an ethernet header and an IP header, and the information format of the collected INT probe packet includes a switch ID, an ingress port number and an egress port number.
An INT probe packet is generated using a server. In the Fat-tree architecture data center network, servers connected under one ToR switch share the same path information, so that each server does not need to participate in sending an INT detection packet, and only one server needs to be designated as an INT detection packet sending end under each ToR switch.
Further, the probe packet processing is performed on the exchange side. In a standard INT framework, an INT enabled device may provide a number of link state information including switch IDs, ingress, egress port numbers, queue depth, forwarding delays, etc. The invention is mainly used for detecting and positioning the link fault, so that the INT detection packet is simplified, and the INT detection packet occupies as little network bandwidth resource as possible while realizing the required function. The hardware information that the simplified INT detection packet needs to collect is:
switch_id (8-bit): switch IDs, assigned by the remote centralized controller, are unique to each switch.
Ingress_port (8-bit): an ingress port number through which probe packets enter the switch.
The egress_port (8-bit) is the port number through which the probe packet exits the switch.
In implementation, as shown in fig. 5, the initial probe packet generated by the INT probe packet sender server is composed of an ethernet header and an IP header. The protocol field in the IP header is 0x700, indicating that the data packet is a simplified INT probe packet. As the probe packet passes through the network as shown in fig. 6, the passing switch will append the required INT information after the INT probe packet IP header, i.e. the INT information in the probe packet is in the form of a stack.
In the step S202, the INT probe packet is parsed by the receiving end in the data center network, and stored in the local database in the form of a path information table. That is, after the INT detection packet receiving end receives the detection packet, the INT information carried in the detection packet is parsed and stored in the local database in the form of a path information table.
It should be noted that the path information table at least includes a preset aging time of each path, where the preset aging time is used to detect a network failure. For example, the path information table stored in server a contains all possible paths from server a to all other accessible servers. Each path is represented by a switch ID along the way and a corresponding combination of ingress and egress port numbers.
In the step S203, it is detected whether the network failure exists in the server based on the preset aging time in the path information table. I.e. network failures including grey failures are found by path aging.
In practice, each path in the path information table has an aging time, that is, no new probe packet arrives for a certain time to update the state of the relevant path, and the paths are deleted. Therefore, if there is a link failure in the network, the relevant probe packet cannot reach the receiving end, the relevant path stored on the receiving end is aged, and the server determines that a certain part of the affected path has a failure. It should be noted that, when a link is congested, a probe packet may not reach the receiving end in a short time, so that the server may misunderstand that the link is faulty. Therefore, in actual operation, the aging time of the paths in the path information table can be appropriately increased, so as to filter the condition of partial link congestion. For very severe congestion, it may not be possible to filter it even with increased aging time, but in this case the link is reasonably equivalent to a link failure and handled.
If the network failure is detected in the server in step S204 described above, rerouting is performed through the source route. Fast reroute may be performed through source routing.
In implementation, as shown in fig. 7, the IP header protocol field of the normal stream packet is 0x701, and SR information indicating the next hop such as a port number is queued after the IP header. As shown in fig. 8, when the data packet passes through the switch, the switch forwards the data packet to the designated egress port according to the parsed first (i.e., leftmost) SR information, and deletes the SR information at the same time, so as to ensure that the subsequent switch does not reuse the used SR information. When a link failure is detected, a path information table in the server is updated, the affected path is deleted due to aging, and all subsequent data packets cannot pass through the failed link, so that rapid rerouting of subsequent traffic is realized within one aging time.
In step S205 described above, although fault detection and fast reroute have been implemented in the data plane, fault localization is still required to repair the fault. Because there is no shared full network topology between distributed servers, the failure path taken by each server alone is not sufficient to infer the specific failure location. Therefore, in the embodiment of the present application, a remote centralized controller is introduced based on the SDN design specification to collect the fault path information uploaded by each server, and network fault location is centrally performed through the fault path information uploaded by a plurality of servers.
Based on the steps, the whole network telemetry based on the topological characteristics of the data center network is realized. The whole network telemetry needs to cover each port of each device, considering that a grey fault may occur on any port of any device in the DCN, possibly more than one. In a data center network, however, simple broadcasting may introduce loops. The method adopts a topology-based multicast, whole-network coverage and loop-free data center network whole-network telemetry scheme by researching the structure of the data center network.
Based on the above steps, the method and the device realize timely adjustment of the traffic for the network faults, and due to the gray faults, a large amount of silence packet loss is caused, and even if a network operator or an end user can notice abnormal traffic behaviors, a long time is usually required to locate the gray faults before the traffic bypasses the fault area. The longer it takes to locate the fault, the greater the flow loss. In addition, loss of traffic will further trigger retransmission of packets on the end hosts, resulting in network congestion. The present application proposes a strategy for flow adjustment immediately after a fault is perceived by a data plane, without first completing fault localization.
Based on the steps, network fault location including grey faults is achieved, through whole network telemetry and local rerouting decision, the data plane can directly conduct flow adjustment to avoid fault areas, but in addition, network operators also need to master specific positions of faults, and therefore the faults are thoroughly repaired. The application provides a network fault positioning strategy including gray faults based on whole network telemetry results and provides repair suggestions. Preferably, the present application is not directed to the case of failure at a single location alone, and failure at multiple locations in a data center network can still be detected and failure location achieved by the present invention.
As a preference in this embodiment, the rerouting by source routing further includes: adding a source route field into a data packet of the data center network; determining a path for forwarding the data packet according to the source routing field; and when the network fault exists in the server, deleting the aged path, updating the path information table, and then recovering the link.
In particular, in the system, an SR field is added inside a normal data packet. The SR is Source route, which is a mechanism for adding hop-by-hop routing information to a data packet at a data packet transmitting end and for specifying a definite path for forwarding the data packet, that is, the switch forwards the normal data packet completely according to the read SR information, and does not rely on a traditional mode of looking up a routing table.
As a preferred embodiment, the detecting whether the network failure exists in the server based on the preset aging time in the path information table includes: detecting whether the INT detection packet is received by the receiving end or not based on the preset aging time of the path in the path information table; if the INT detection packet is detected not to be received by the receiving end, deleting the path and considering that the network fault exists in the server, wherein the network fault at least comprises one of the following steps: and the link failure that the detection packet cannot reach the receiving end due to the congestion of the link, the link failure that the detection packet cannot reach the receiving end and the path stored on the receiving end is aged.
In practice, each path in the path information table has an aging time, that is, no new probe packet arrives for a certain time to update the state of the relevant path, and the paths are deleted. Therefore, if there is a link failure in the network, the relevant probe packet cannot reach the receiving end, the relevant path stored on the receiving end is aged, and the server determines that a certain part of the affected path has a failure. It should be noted that, when a link is congested, a probe packet may not reach the receiving end in a short time, so that the server may misunderstand that the link is faulty. Therefore, in actual operation, the aging time of the paths in the path information table can be appropriately increased, so as to filter the condition of partial link congestion. For very severe congestion, it may not be possible to filter it even with increased aging time, but in this case the link is reasonably equivalent to a link failure and handled.
As a preferable aspect of the present embodiment, after detecting whether the network failure exists in the server based on the preset aging time in the path information table, the method further includes: obtaining a repair evaluation result according to a network fault positioning strategy, wherein the network fault at least comprises a gray fault; detecting the network fault in the data center according to a preset probability model, a preset error scale and a K value of a preset Top K algorithm; the preset probability model comprises a first probability model for setting different probabilities of occurrence errors according to the conditions of links, and a second probability model with unified probabilities of occurrence errors according to all links; the preset error scale is used as a ratio of different error scales, wherein the ratio of the error scales = the number of error link stripes/the total stripes in the topology; the K value of the Top K algorithm is preset, and the K value is determined according to the number of fault repairing rounds, time consumption of each round and accuracy.
In practice, network failures including gray failures need to be located. Three sets of variables, respectively different probability models, different error scales and different K values (Top K algorithm) need to be set.
(1) Probability model
Two probability models are set according to the link conditions. First, different probabilities of error occurrence are set according to the link conditions. Through investigation, the data center is found that due to the fact that the materials used by links are different, the traffic scale is different and the like, certain links are easier to fail than other links, and the aging probability is higher. The probability set for each link is different when errors occur in the calculated links. The differential probability is set according to the characteristics of the links, so that the system is more close to the actual data center network structure. Because simulation software is used, different probabilities are randomly generated for each link, and a model of errors of the links can be established through a large amount of data in an actual data center network, the probability of errors of each link is counted, and the obtained result is more accurate. Second, the probabilities of all links having errors are unified. Unlike the first case, the difference in probability of occurrence of errors for each link in the data center network is ignored, and the probability of occurrence of errors for all links is set to P. I.e. the probability of error, aging, etc. per link is equal. This is desirable and also allows better verification of the functions implemented by the system from a quantitative point of view. In both cases, the number of fail-over rounds, time consumption per round and accuracy are obtained. And researching the influence of different probability models on the network fault detection of the data center by comparing various indexes under the two probability models.
(2) Error scale
In order to determine the performance of the fault detection method in the presence of different error scales, a very important parameter is set, namely the ratio of error scale (number of error links present/total number of topology). The number of failure recovery rounds, time consumption and accuracy of the system were tested at different error scales of the links, i.e. when the error rate of the links was different. The testing process adopts the idea that the error occurrence ratio of the link is increased by 10% to 100% each time. By controlling the difference of the error scales of the links, the performance of the system in the data center networks with different error scales is further qualitatively researched.
(3) Different K values
The TOP K algorithm is an extremely important algorithm in the system, and not only affects the efficiency of the system, but also the accuracy of the system can be greatly affected by different K values. The Top K algorithm mentioned before is derived to obtain the accuracy and time consumption that the Top K algorithm can greatly affect. The number of fault repair rounds, time consumption and accuracy of each round of the system are observed according to different K values. In order to get how large a suitable K value is for application to the data center network.
Qualitative and quantitative analysis test results: the higher the K value, the higher the accuracy of the repair, but the time spent per round increases as the K value increases. That is, if a more accurate inspection is desired, efficiency must be sacrificed, i.e., more time is spent in performing the calculations. Similarly, the greater the size of the data center network errors, the greater the accuracy of the detection, but this also results in a longer detection time. Therefore, according to the characteristics of different data center networks, the K value needs to be continuously adjusted to obtain an optimal solution.
As a preferable mode in this embodiment, the centralized network fault location by collecting fault path information uploaded by a plurality of servers includes: obtaining link information change by monitoring a Redis database, and placing the link information with the change into a target set, wherein the target set comprises a set of vanishing paths and a set of repairing paths; and calculating the probability of the error link according to the data in the vanishing path set.
In the specific implementation, the change of the link information is obtained through monitoring the Redis database. When the change is known, the information of the change is put into the collection. Considering other reasons such as packet loss caused by network congestion or poor link state, the link is marked as an aging circuit after the link disappears for a certain time. By constructing two sets, a set of vanishing paths and a set of repair paths are represented, respectively. And calculating the probability of the error link through the data in the vanishing path set, thereby giving the restoration opinion. When the link is repaired (not necessarily completely repaired), some paths are added to the database, the added paths are put into the repair path set, and they are eliminated from the vanished path set. And calculating the updated set of the wrong link again, giving out the link error probability and the restoration opinion, and restoring. The process is cycled until the vanishing path set is 0, indicating that all paths have been repaired, i.e., the network is restored to normal.
As a preferable example in this embodiment, further comprising: calculating the fault path through a fault tree analysis algorithm; calculating the error position in the link through a fault tree analysis algorithm; and analyzing and positioning the error position in a concentrated manner through a fault tree analysis algorithm and a plurality of fault paths, and optimizing in the process of the fault tree analysis algorithm according to a minimum cut set and a Top K algorithm.
In specific implementation, the fault path is calculated through a fault tree analysis algorithm. Fault tree analysis (Fault Tree Analysis, FTA) is a top-down, top-event or base event occurrence analysis method using boolean logic algorithm based on the relationship between top events and base events.
The state of network a is determined by both P1, P2. The state of P1 is either L1 occurrence or L2 occurrence or L3 occurrence, i.e., p1=l1 or l2 or l3=l1+l2+l3. Similarly p2=l1 or l2 or l4=l1+l2+l4 can be obtained. While state a occurs under the condition that P1 and P2 occur simultaneously, i.e., a=p1 and p2=p1×p2.
To sum up, a=p1p2
=(L1+L2+L3)(L1+L2+L4)
=L1L1+L1L2+L1L4+L2L1+L2L2+L2L4+L3L1+L3L2+L3L4
=L1+L1L2+L1L3+L1L4+L2+L2L3+L2L4+L3L4
The fault combination resulting in overhead state a can be obtained by logically anding successively received fault paths and then spreading them out into an and-or. The sequence can be carried out according to the fault probability of each link, and references are provided for operation and maintenance personnel.
The two paths are logically anded and extended like a polynomial multiplication extension that is too long. For example, (L1+L2) (L1+L3), 4 operations are required to obtain L1+L1L2+L1L3+L2L3. If a path is received and calculated, 8 operations are needed. Therefore, assuming that each path is k hops, n fault paths are all logically and unfolded, if the fault paths received successively are all logically and unfolded, the result of the previous n-1 times of calculation has k (n-1) terms, the nth path has k terms, and the algorithm complexity of the logically and is O (k n).
And (5) optimizing a fault tree analysis algorithm. First, reduce the cut set to the minimum cut set, make the following deductions: assuming that the condition for occurrence of the event a is a=l1+l1l2+l1l3+l1l4+l2+l2l3+l2l4+l3l4, all 8 cutsets can be established for a alone, but the event a must occur regardless of whether or not L1L2 is satisfied when L1 and L1L2 have a containment relationship. The cut set is reduced to the smallest cut set according to the axiom of the absorbance in discrete mathematics to a=l1+l2+l3l4. This approach reduces the accuracy of the calculation, discarding a portion of the fault combinations, but can greatly improve the algorithm efficiency. Second, the term with the lower probability of minimal cut concentration is removed. Assuming that each link fails independently of each other and the probabilities of occurrence of L1 and L2 and L3L4 are 0.1, there is a significant gap (the probability of occurrence of an error in L1 is 0.1, the probability of occurrence of an error in L2 is 0.1, and the probability of occurrence of an error in L3L4 is 0.1×0.1=0.01). Therefore, the first K error paths with the highest probability can be recommended each time, and the working efficiency of operation and maintenance personnel is improved. The implementation of the method mainly comprises the steps of traversing all error paths, selecting the first K paths and recommending the paths to operation staff. So a Top K algorithm solution is adopted.
Based on the above steps, the link error probability is calculated through the vanishing paths in the database, and since each link is used by multiple paths in the data center, when the link fails, multiple paths are affected. And after collecting the vanishing paths through the remote centralized controller, the suggestion of repairing the paths is given through calculating the probability of failure of the vanishing paths and sequencing the vanishing paths. The error path probability is obtained by logic operation mainly according to links in the vanishing path set.
In addition, the error position of the link is calculated through a fault tree analysis algorithm, and the error specific position is positioned through the fault tree analysis algorithm according to a plurality of fault paths. Meanwhile, optimization is carried out according to the minimum cut set and the Top K algorithm, so that the operation cost and time are reduced.
There is further provided, according to an embodiment of the present application, a fault detection apparatus for a data center network based on an in-band network telemetry system for implementing the above method, as shown in fig. 3, the apparatus including:
an INT detection module 301, configured to generate an INT detection packet based on an in-band network telemetry system, and forward the INT detection packet in the data center network according to a preset detection path;
The parsing module 302 is configured to parse the INT probe packet through a receiving end in the data center network, and store the INT probe packet in a local database according to a path information table, where the path information table at least includes a preset aging time of each path, and the preset aging time is used to detect a network failure;
a network fault detection module 303, configured to detect whether the network fault exists in the server based on the preset aging time in the path information table;
a rerouting module 304, configured to detect that the network failure exists in the server, and reroute the network failure through the source route;
and the fault locating module 305 is configured to centrally locate network faults by collecting fault path information uploaded by a plurality of servers.
INT is a fine-grained network measurement architecture in the INT detection module 301 that is primarily used to collect real-time network link states in the data plane without excessive participation by the control plane. In the INT model, a transmitting end transmits special INT probe packets for full network traversal, and when the INT probe packets pass through devices with INT information acquisition functions along a probe path, the devices insert corresponding INT information into the probe packets. The INT detection packet is finally collected and processed by a remote centralized controller for realizing whole network congestion control, load balancing, network fault detection and the like.
The parsing module 302 parses the INT probe packet through a receiving end in the data center network, and stores the INT probe packet in a local database according to a path information table. That is, after the INT detection packet receiving end receives the detection packet, the INT information carried in the detection packet is parsed and stored in the local database in the form of a path information table.
It should be noted that the path information table at least includes a preset aging time of each path, where the preset aging time is used to detect a network failure. For example, the path information table stored in server a contains all possible paths from server a to all other accessible servers. Each path is represented by a switch ID along the way and a corresponding combination of ingress and egress port numbers.
In the network failure detection module 303, based on the preset aging time in the path information table, it is detected whether the network failure exists in the server. I.e. network failures including grey failures are found by path aging.
In practice, each path in the path information table has an aging time, that is, no new probe packet arrives for a certain time to update the state of the relevant path, and the paths are deleted. Therefore, if there is a link failure in the network, the relevant probe packet cannot reach the receiving end, the relevant path stored on the receiving end is aged, and the server determines that a certain part of the affected path has a failure. It should be noted that, when a link is congested, a probe packet may not reach the receiving end in a short time, so that the server may misunderstand that the link is faulty. Therefore, in actual operation, the aging time of the paths in the path information table can be appropriately increased, so as to filter the condition of partial link congestion. For very severe congestion, it may not be possible to filter it even with increased aging time, but in this case the link is reasonably equivalent to a link failure and handled.
If the network failure is detected in the server in the rerouting module 304, rerouting is performed through source routing. Fast reroute may be performed through source routing.
Although fault detection and fast reroute have been implemented in the data plane in the fault localization module 305, fault localization is still required to repair the fault. Because there is no shared full network topology between distributed servers, the failure path taken by each server alone is not sufficient to infer the specific failure location. In the embodiment of the application, a remote centralized controller is introduced to collect fault path information uploaded by each server based on an SDN design specification in the prior art, and network fault location is centrally performed through fault path information uploaded by a plurality of servers.
In order to further explain the scheme, the application also provides a specific application example of the fault detection method of the data center network based on the in-band network telemetry system, which specifically comprises the following contents:
the system based on the fault detection party of the data center network based on the in-band network telemetry system comprises: an in-band whole network telemetry system, a network fault detection and fast reroute system, and a network fault notification and positioning system.
The server periodically transmits an INT detection packet to check whether all paths between the source end and the destination end are feasible, the INT detection packet records a corresponding switch ID, an inlet port number and an outlet port number after passing through the switch, and the switch multicasts the detection packet according to a certain forwarding rule to cover all feasible paths.
Each probe packet receiving end server stores these feasible paths in a path information table, each path being accompanied by an aging time. According to the path information table, the server may send the data packet through a Source Routing (SR) through a specified feasible path. When a network failure occurs, the relevant paths in the path information table will age because the probe packet cannot pass through the failed link, and these paths will not be selected as the paths to send traffic. In addition, the aged path may be reported to the remote centralized controller as path failure information.
Because path failure information on a single server is not sufficient for accurate network failure localization, the controller needs to receive path failure information from all affected servers and perform centralized analysis. More specifically, the controller needs to find the commonality between these aged paths until the failure converges to a link between the two devices.
(a) In-band full-network telemetry system
1.1 planning a probe path covering the whole network. In order for an INT probe packet to cover the whole network, specific forwarding rules need to be set within the switch according to a specific network topology.
1.2 INT probe packets are generated using a server. In the Fat-tree architecture data center network, servers connected under one ToR switch share the same path information, so that each server does not need to participate in sending an INT detection packet, and only one server needs to be designated as an INT detection packet sending end under each ToR switch.
The INT detection packet message format for fault detection indicates the hardware information that can be provided by the INT supporting device in the INT specification, but for the problem of fault detection, only the detection packet is required to collect the switch ID, the input port number and the output port number. Therefore, the message format of the INT detection packet facing the fault detection is an Ethernet header and an IP header (the protocol field is 0x 700), and the collected INT information format is a switch ID (8-bit), an ingress port number (8-bit) and an egress port number (8-bit).
1.3 probe packet processing at the exchange side. In a standard INT framework, an INT enabled device may provide a number of link state information including switch IDs, ingress, egress port numbers, queue depth, forwarding delays, etc. The invention is mainly used for detecting and positioning the link fault, so that the INT detection packet is simplified, and the INT detection packet occupies as little network bandwidth resource as possible while realizing the required function.
(b) Network fault detection and fast reroute system
2.1 path information table. After the INT detection packet receiving end receives the detection packet, the INT information carried in the detection packet is analyzed and stored in a local database in the form of a path information table.
Network failures including gray failures are found by path aging. Each path in the path information table has an aging time, i.e. no new probe packet arrives for a certain time to update the state of the relevant path, and the paths are deleted. Therefore, if there is a link failure in the network, the relevant probe packet cannot reach the receiving end, the relevant path stored on the receiving end is aged, and the server determines that a certain part of the affected path has a failure. It should be noted that, when a link is congested, a probe packet may not reach the receiving end in a short time, so that the server may misunderstand that the link is faulty. Therefore, in actual operation, the aging time of the paths in the path information table can be appropriately increased, so as to filter the condition of partial link congestion. For very severe congestion, it may not be possible to filter it even with increased aging time, but in this case the link is reasonably equivalent to a link failure and handled.
And periodically detecting the on-off state of a feasible path in the network by using an INT detection packet, maintaining a path information table in a server to store the feasible path, wherein each piece of path information has aging time, and deleting the path information if a new INT detection packet is not received for refreshing related path information within a period of time. Since network failure is a main cause that the INT probe packet cannot arrive at the receiving end on time, the aging of the path information can be used as a basis for judging whether the network has failure.
2.2 fast reroute through source route. In this system, an SR field is added inside a normal packet. The SR is Source route, which is a mechanism for adding hop-by-hop routing information to a data packet at a data packet transmitting end and for specifying a definite path for forwarding the data packet, that is, the switch forwards the normal data packet completely according to the read SR information, and does not rely on a traditional mode of looking up a routing table.
The data packet forwarding is performed based on the SR mechanism, and one great advantage is that the data packet transmitting end can completely specify the forwarding path of the data packet, so that after the transmitting end server senses the fault, the subsequent flow can be rapidly rerouted to other feasible paths by reselecting the path in the path information table, and the loss caused by packet loss due to link fault is reduced.
(c) Network fault notification and positioning system
3.1 remote notification of failure. Although fault detection and fast reroute have been implemented in the data plane, fault localization is still required to repair the fault. Because there is no shared full network topology between distributed servers, the failure path taken by each server alone is not sufficient to infer the specific failure location. According to the invention, a remote centralized controller is introduced according to SDN design specifications to collect fault path information uploaded by each server and perform centralized network fault positioning. By listening to the dis database, a change in link information is obtained. When the change is known, the information of the change is put into the collection. Considering other reasons such as packet loss caused by network congestion or poor link state, the link is marked as an aging circuit after the link disappears for a certain time. Two sets are constructed, representing a set of vanishing paths and a set of repair paths, respectively. And calculating the probability of the error link through the data in the vanishing path set, thereby giving the restoration opinion. When the link is repaired (not necessarily completely repaired), some paths are added to the database, the added paths are put into the repair path set, and they are eliminated from the vanished path set. And calculating the updated set of the wrong link again, giving out the link error probability and the restoration opinion, and restoring. The process is cycled until the vanishing path set is 0, indicating that all paths have been repaired, i.e., the network is restored to normal.
3.2 calculating a fault path through a fault tree analysis algorithm. Fault tree analysis (Fault Tree Analysis, FTA) is a top-down, top-event or base event occurrence analysis method using boolean logic algorithm based on the relationship between top events and base events.
3.3 fault tree analysis algorithm optimization. First, reduce the cut set to the minimum cut set, make the following deductions: assuming that the condition for occurrence of the event a is a=l1+l1l2+l1l3+l1l4+l2+l2l3+l2l4+l3l4, all 8 cutsets can be established for a alone, but the event a must occur regardless of whether or not L1L2 is satisfied when L1 and L1L2 have a containment relationship. The cut set is reduced to the smallest cut set according to the axiom of the absorbance in discrete mathematics to a=l1+l2+l3l4. This approach reduces the accuracy of the calculation, discarding a portion of the fault combinations, but can greatly improve the algorithm efficiency. Second, the term with the lower probability of minimal cut concentration is removed. Assuming that each link fails independently of each other and the probabilities of occurrence of L1 and L2 and L3L4 are 0.1, there is a significant gap (the probability of occurrence of an error in L1 is 0.1, the probability of occurrence of an error in L2 is 0.1, and the probability of occurrence of an error in L3L4 is 0.1×0.1=0.01). Therefore, the first K error paths with the highest probability can be recommended each time, and the working efficiency of operation and maintenance personnel is improved. The implementation of the method mainly comprises the steps of traversing all error paths, selecting the first K paths and recommending the paths to operation staff. So a Top K algorithm solution is adopted.
The link error probability is calculated through the vanishing paths in the database, and each link in the data center is used by a plurality of paths, so that when the link fails, the plurality of paths are affected. And after collecting the vanishing paths through the remote centralized controller, the suggestion of repairing the paths is given through calculating the probability of failure of the vanishing paths and sequencing the vanishing paths. The error path probability is obtained by logic operation mainly according to links in the vanishing path set.
Calculating the error position of the link through a fault tree analysis algorithm, and analyzing the error position in a centralized way according to a plurality of fault paths through the fault tree analysis algorithm. Meanwhile, optimization is carried out according to the minimum cut set and the Top K algorithm, so that the operation cost and time are reduced.
The embodiment of the present application further provides a specific implementation manner of an electronic device capable of implementing all the steps in the blockchain-based industrial internet identification information analysis method in the foregoing embodiment, where a processor is configured to invoke a computer program in the memory, where the processor implements all the steps in the blockchain-based industrial internet identification information analysis method in the foregoing embodiment when the processor executes the computer program, for example, the processor implements the following steps when the processor executes the computer program:
Step 100: generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path;
step 200: analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database in the form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults;
step 300: detecting whether the network fault exists in a server or not based on the preset aging time in the path information table;
step 400: if the network fault exists in the detection server, rerouting through the source route;
step 500: and collecting fault path information uploaded by a plurality of servers, and carrying out network fault positioning in a centralized way.
The embodiments of the present application further provide a computer readable storage medium capable of implementing all the steps in the fault detection method for a data center network based on an in-band network telemetry system in the above embodiments, where the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, implements all the steps in the fault detection method for a data center network based on an in-band network telemetry system in the above embodiments, for example, the processor implements the following steps when executing the computer program:
Step 100: generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path;
step 200: analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database in the form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults;
step 300: detecting whether the network fault exists in a server or not based on the preset aging time in the path information table;
step 400: if the network fault exists in the detection server, rerouting through the source route;
step 500: and collecting fault path information uploaded by a plurality of servers, and carrying out network fault positioning in a centralized way.
As can be seen from the above description, the computer readable storage medium provided in the embodiments of the present application is capable of generating an INT probe packet based on an in-band network telemetry system, and forwarding the INT probe packet in the data center network according to a preset probe path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database in the form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the detection server, rerouting through the source route; and collecting fault path information uploaded by a plurality of servers, and carrying out network fault positioning in a centralized way. According to the method and the device, full-network telemetry based on the data center network topology characteristics is achieved, the flow is adjusted timely for network faults, network fault positioning including gray faults is achieved, and therefore the problems that multiple faults occur in a data center network at the same time, rapid detection is needed, and multiple faults occur in the data center network at the same time are solved.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a hardware+program class embodiment, the description is relatively simple, as it is substantially similar to the method embodiment, as relevant see the partial description of the method embodiment.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Although the present application provides method operational steps as described in the examples or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an actual device or client product, the instructions may be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment) as shown in the embodiments or figures.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a car-mounted human-computer interaction device, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.
The present embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments. In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present specification. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
The foregoing is merely an example of the present specification and is not intended to limit the present specification. Various modifications and variations of the illustrative embodiments will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of the embodiments of the present specification, should be included in the scope of the claims of the embodiments of the present specification.

Claims (8)

1. A method for detecting a failure of a data center network based on an in-band network telemetry system, comprising:
generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path;
analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database in the form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults;
detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; obtaining a repair evaluation result according to a network fault positioning strategy, wherein the network fault at least comprises a gray fault; detecting the network faults in a data center according to a preset probability model, a preset error scale and a K value of a preset Top K algorithm; the preset probability model comprises a first probability model for setting different probabilities of occurrence errors according to the conditions of links, and a second probability model with unified probabilities of occurrence errors according to all links; the preset error scale is used as a ratio of different error scales, wherein the ratio of the error scales = the number of error link stripes/the total stripes in the topology; the K value of the Top K algorithm is preset, and the K value is determined according to the number of fault repairing rounds, time consumption of each round and accuracy;
If the network failure exists in the detection server, rerouting through the source route, the rerouting through the source route comprising: adding a source route field into a data packet of the data center network; determining a path for forwarding the data packet according to the source routing field; when the network fault exists in the server, deleting the aged path, updating the path information table, and then recovering a link;
and collecting fault path information uploaded by a plurality of servers, and carrying out network fault positioning in a centralized way.
2. The method of claim 1, wherein the detecting whether the network failure exists in a server based on the preset aging time in the path information table comprises:
detecting whether the INT detection packet is received by the receiving end or not based on the preset aging time of the path in the path information table;
if the INT detection packet is detected not to be received by the receiving end, deleting the path and considering that the network fault exists in the server, wherein the network fault at least comprises one of the following steps: and the link failure that the detection packet cannot reach the receiving end due to the congestion of the link, the link failure that the detection packet cannot reach the receiving end and the path stored on the receiving end is aged.
3. The method according to claim 1, wherein the message format of the INT probe packet includes an ethernet header and an IP header, and the information format of the INT probe packet includes a switch ID, an ingress port number, and an egress port number.
4. The method of claim 1, wherein the centralized network fault location by collecting fault path information uploaded by a plurality of servers comprises:
obtaining link information change by monitoring a Redis database, and placing the link information with the change into a target set, wherein the target set comprises a set of vanishing paths and a set of repairing paths;
and calculating the probability of the error link according to the data in the vanishing path set.
5. The method as recited in claim 4, further comprising:
calculating the fault path through a fault tree analysis algorithm;
calculating the error position in the link through a fault tree analysis algorithm;
and analyzing and positioning the error position in a concentrated manner through a fault tree analysis algorithm and a plurality of fault paths, and optimizing in the process of the fault tree analysis algorithm according to a minimum cut set and a Top K algorithm.
6. A data center network fault detection device based on an in-band network telemetry system, comprising:
the INT detection module is used for generating an INT detection packet based on an in-band network telemetry system and forwarding the INT detection packet in the data center network according to a preset detection path;
the analyzing module is used for analyzing the INT detection packet through a receiving end in the data center network and storing the INT detection packet in a local database in a form of a path information table, wherein the path information table at least comprises preset ageing time of each path, and the preset ageing time is used for detecting network faults;
the network fault detection module is used for detecting whether the network fault exists in the server or not based on the preset aging time in the path information table; obtaining a repair evaluation result according to a network fault positioning strategy, wherein the network fault at least comprises a gray fault; detecting the network faults in a data center according to a preset probability model, a preset error scale and a K value of a preset Top K algorithm; the preset probability model comprises a first probability model for setting different probabilities of occurrence errors according to the conditions of links, and a second probability model with unified probabilities of occurrence errors according to all links; the preset error scale is used as a ratio of different error scales, wherein the ratio of the error scales = the number of error link stripes/the total stripes in the topology; the K value of the Top K algorithm is preset, and the K value is determined according to the number of fault repairing rounds, time consumption of each round and accuracy;
The rerouting module is used for rerouting through the source route when the network fault exists in the server, and adding a source route field into a data packet of the data center network; determining a path for forwarding the data packet according to the source routing field; when the network fault exists in the server, deleting the aged path, updating the path information table, and then recovering a link;
and the fault positioning module is used for intensively positioning network faults by collecting fault path information uploaded by a plurality of servers.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the method for fault detection of a data center network based on an in-band network telemetry system as claimed in any one of claims 1 to 5.
8. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the method for fault detection of a data center network based on an in-band network telemetry system as claimed in any one of claims 1 to 5.
CN202111027721.8A 2021-09-02 2021-09-02 Data center network fault detection method and device based on in-band network telemetry system Active CN113938407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111027721.8A CN113938407B (en) 2021-09-02 2021-09-02 Data center network fault detection method and device based on in-band network telemetry system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111027721.8A CN113938407B (en) 2021-09-02 2021-09-02 Data center network fault detection method and device based on in-band network telemetry system

Publications (2)

Publication Number Publication Date
CN113938407A CN113938407A (en) 2022-01-14
CN113938407B true CN113938407B (en) 2023-06-20

Family

ID=79275037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111027721.8A Active CN113938407B (en) 2021-09-02 2021-09-02 Data center network fault detection method and device based on in-band network telemetry system

Country Status (1)

Country Link
CN (1) CN113938407B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114760225A (en) * 2022-03-31 2022-07-15 深信服科技股份有限公司 Fault diagnosis method, system and storage medium
CN114866431A (en) * 2022-04-28 2022-08-05 深圳智芯微电子科技有限公司 Method and device for predicting SFC network fault based on INT and processor
CN114844708B (en) * 2022-05-07 2024-06-18 长三角信息智能创新研究院 Method, equipment and storage medium for relieving flooding attack based on traffic rerouting link
CN115396355A (en) * 2022-08-25 2022-11-25 北京有竹居网络技术有限公司 Network path detection method and device and electronic equipment
CN116962143B (en) * 2023-09-18 2024-01-26 腾讯科技(深圳)有限公司 Network fault detection method, device, computer equipment and storage medium
CN117411806B (en) * 2023-12-13 2024-03-08 国网浙江省电力有限公司信息通信分公司 Power communication network performance evaluation method, system, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8989194B1 (en) * 2012-12-18 2015-03-24 Google Inc. Systems and methods for improving network redundancy and for facile initialization in a centrally-controlled network
CN109787833A (en) * 2019-01-23 2019-05-21 清华大学 Network exception event cognitive method and system
WO2019239189A1 (en) * 2018-06-13 2019-12-19 Telefonaktiebolaget Lm Ericsson (Publ) Robust node failure detection mechanism for sdn controller cluster
CN113271225A (en) * 2021-05-18 2021-08-17 浙江大学 Network reliability evaluation method based on in-band network telemetry technology

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105281945B (en) * 2014-09-19 2020-04-07 中国人民解放军第二炮兵工程大学 Deterministic network integrity fault detection method based on data flow
CN108199924B (en) * 2018-01-26 2020-02-18 北京邮电大学 Whole network flow visualization method and device based on in-band network telemetry
CN110224883B (en) * 2019-05-29 2020-11-27 中南大学 Gray fault diagnosis method applied to telecommunication bearer network
CN111581036B (en) * 2020-03-31 2022-04-15 西安电子科技大学 Internet of things fault detection method, detection system and storage medium
CN112422498B (en) * 2020-09-04 2023-04-14 网络通信与安全紫金山实验室 In-band network remote measuring method, system and computer readable storage medium
CN112866075B (en) * 2020-12-21 2023-03-24 网络通信与安全紫金山实验室 In-band network telemetering method, system and related device for Overlay network
CN112702330B (en) * 2020-12-21 2022-07-01 网络通信与安全紫金山实验室 Lightweight in-band network telemetry method and device for Overlay network and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8989194B1 (en) * 2012-12-18 2015-03-24 Google Inc. Systems and methods for improving network redundancy and for facile initialization in a centrally-controlled network
WO2019239189A1 (en) * 2018-06-13 2019-12-19 Telefonaktiebolaget Lm Ericsson (Publ) Robust node failure detection mechanism for sdn controller cluster
CN109787833A (en) * 2019-01-23 2019-05-21 清华大学 Network exception event cognitive method and system
CN113271225A (en) * 2021-05-18 2021-08-17 浙江大学 Network reliability evaluation method based on in-band network telemetry technology

Also Published As

Publication number Publication date
CN113938407A (en) 2022-01-14

Similar Documents

Publication Publication Date Title
CN113938407B (en) Data center network fault detection method and device based on in-band network telemetry system
US11575559B1 (en) Monitoring and detecting causes of failures of network paths
CN111147287B (en) Network simulation method and system in SDN scene
US8661295B1 (en) Monitoring and detecting causes of failures of network paths
CN109314652B (en) Network performance measurement method and device
US9712381B1 (en) Systems and methods for targeted probing to pinpoint failures in large scale networks
US9369360B1 (en) Systems and methods for fault detection in large scale networks
KR20170049509A (en) Collecting and analyzing selected network traffic
CN102868553B (en) Fault Locating Method and relevant device
CN110601888A (en) Deterministic fault detection and positioning method and system in time-sensitive network
CN108449210B (en) Network routing fault monitoring system
EP3222003B1 (en) Inline packet tracing in data center fabric networks
US20220052923A1 (en) Data processing method and device, storage medium and electronic device
CN105721184A (en) Network link quality monitoring method and apparatus
US11349703B2 (en) Method and system for root cause analysis of network issues
CN111404822B (en) Data transmission method, device, equipment and computer readable storage medium
CN106789625A (en) A kind of loop detecting method and device
CN109245961A (en) Link-quality detection method, device, storage medium and equipment
CN104283780A (en) Method and device for establishing data transmission route
CN110071843B (en) Fault positioning method and device based on flow path analysis
Bouillard et al. Hidden anomaly detection in telecommunication networks
EP3718261B1 (en) System for network event detection and analysis
CN116909817A (en) Dedicated line control method, device, computer equipment and storage medium
CN115955690A (en) Wireless signal strength based detection of poor network link performance
CN110830327B (en) Method for realizing process layer network full link monitoring and alarming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant