CN113938407A - Data center network fault detection method and device based on in-band network telemetry system - Google Patents

Data center network fault detection method and device based on in-band network telemetry system Download PDF

Info

Publication number
CN113938407A
CN113938407A CN202111027721.8A CN202111027721A CN113938407A CN 113938407 A CN113938407 A CN 113938407A CN 202111027721 A CN202111027721 A CN 202111027721A CN 113938407 A CN113938407 A CN 113938407A
Authority
CN
China
Prior art keywords
network
fault
path
int
data center
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111027721.8A
Other languages
Chinese (zh)
Other versions
CN113938407B (en
Inventor
潘恬
贾晨昊
许凯
宋恩格
张志龙
黄韬
刘韵洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202111027721.8A priority Critical patent/CN113938407B/en
Publication of CN113938407A publication Critical patent/CN113938407A/en
Application granted granted Critical
Publication of CN113938407B publication Critical patent/CN113938407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/28Routing or path finding of packets in data switching networks using route fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/34Source routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04QSELECTING
    • H04Q9/00Arrangements in telecontrol or telemetry systems for selectively calling a substation from a main station, in which substation desired apparatus is selected for applying a control signal thereto or for obtaining measured values therefrom
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the application provides a fault detection method and a related device of a data center network based on an in-band network telemetry system, wherein the method comprises the following steps: generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database according to a form of a path information table; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the server, rerouting is carried out through the source route; and the network fault location is carried out in a centralized way by collecting the fault path information uploaded by the plurality of servers. The method and the device can solve the problems that multiple faults simultaneously appear in a data center network, and rapid detection and positioning for the multiple faults are needed.

Description

Data center network fault detection method and device based on in-band network telemetry system
Technical Field
The application relates to the field of data center network fault detection, in particular to a data center network fault detection method based on an in-band network telemetry system and a related device.
Background
Data centers play a vital role in today's information acquisition, dissemination, computation, storage and management. To meet the increasing and changing user demands, Data centers increasingly need to be supported by larger-scale, higher-density Data Center Networks (DCNs).
In the related art, the gray faults in the network faults are relatively complex and highly destructive. Often the occurrence of gray failures occurs not only at one place, but at multiple places throughout the data center network.
Aiming at the problems that in the related art, a plurality of faults simultaneously appear in a data center network, rapid detection needs to be carried out, and a plurality of faults need to be positioned, an effective solution is not provided at present.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a method and a device for detecting faults of a data center network based on an in-band network telemetry system, which can solve the technical problems that multiple faults simultaneously appear in the data center network, rapid detection needs to be carried out, and multiple faults need to be positioned.
In order to solve at least one of the above problems, the present application provides the following technical solutions:
in a first aspect, the present application provides a method for detecting a failure of a data center network based on an in-band network telemetry system, including: generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the server, rerouting is carried out through the source route; and the network fault location is carried out in a centralized way by collecting the fault path information uploaded by the plurality of servers.
Further, the rerouting by source routing further includes: adding a source routing field into a data packet of the data center network; and determining a path forwarded by the data packet according to the source routing field, deleting the aged path and then updating the path information table when the network fault exists in the server, and then restoring a link.
Further, the detecting whether the network fault exists in the server based on the preset aging time in the path information table includes:
detecting whether the INT detection packet is received by the receiving end or not based on the preset aging time of the path in the path information table;
if the INT detection packet is detected not to be received by the receiving end, deleting the path and considering that the network fault exists in the server, wherein the network fault at least comprises one of the following steps: link failure in which the probe packet cannot reach the receiving end due to congestion of a link, and link failure in which the probe packet cannot reach the receiving end and a path stored on the receiving end is aged.
Further, after detecting whether the network failure exists in the server based on the preset aging time in the path information table, the method further includes: obtaining a repair evaluation result according to a network fault positioning strategy, wherein the network fault at least comprises a grey fault; detecting the network fault in the data center according to a preset probability model, a preset error scale and a K value of a preset Top K algorithm; the preset probability model comprises a first probability model and a second probability model, wherein the first probability model is used for setting different error occurrence probabilities according to link conditions, and the second probability model is unified according to the error occurrence probabilities of all links; the preset error scale is used as the proportion of different error scales, and the proportion of the error scales is the total number of the error link numbers/topology; and the K value of the preset Top K algorithm is determined according to the number of fault repairing rounds, the time consumption of each round and the accuracy.
Further, the INT detection packet adopts the following message format including an ethernet header and an IP header, and the acquired information format of the INT detection packet includes a switch ID, an ingress port number, and an egress port number.
Further, the centralized network fault location by collecting fault path information uploaded by the plurality of servers includes: monitoring a Redis database to obtain the change of link information, and putting the changed link information into a target set, wherein the target set comprises a set of disappearing paths and a set of repairing paths; and calculating the probability of the wrong link according to the data in the vanished path set.
Further, still include: calculating the fault path through a fault tree analysis algorithm; calculating the error position in the link by a fault tree analysis algorithm; and analyzing and positioning the error positions in a centralized manner through a fault tree analysis algorithm and a plurality of fault paths, and optimizing in the process of the fault tree analysis algorithm according to the minimal cut set and the Top K algorithm.
In a second aspect, the present application provides a method. The fault detection device of the data center network based on the in-band network telemetry system comprises: the INT detection module is used for generating an INT detection packet based on the in-band network telemetry system and forwarding the INT detection packet in the data center network according to a preset detection path; the analysis module is used for analyzing the INT detection packet through a receiving end in the data center network and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults; a network fault detection module, configured to detect whether the network fault exists in the server based on the preset aging time in the path information table; a rerouting module, configured to perform rerouting through a source route when detecting that the network fault exists in the server; and the fault positioning module is used for carrying out centralized network fault positioning by collecting the fault path information uploaded by the plurality of servers.
In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for detecting a failure in a data center network based on an in-band network telemetry system when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the method for fault detection for an in-band network telemetry system-based data center network.
According to the technical scheme, the data center network fault detection method and the related device based on the in-band network telemetry system are characterized in that an INT detection packet is generated based on the in-band network telemetry system, and the INT detection packet is forwarded in the data center network according to a preset detection path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the server, rerouting is carried out through the source route; and the network fault location is carried out in a centralized way by collecting the fault path information uploaded by the plurality of servers. By the method and the device, the whole network remote measurement is realized based on the network topology characteristics of the data center, the flow is adjusted in time aiming at the network fault, and the network fault location including the gray fault is realized. Therefore, the problems that multiple faults occur simultaneously in the data center network, rapid detection is needed, and multiple faults need to be located are solved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a system architecture diagram of a fault detection method for a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 2 is a schematic flowchart of a fault detection method for a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 3 is a schematic structural diagram of a fault detection apparatus of a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 4 is a schematic view of a Fat-tree architecture topology structure of a fault detection method for a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 5 is a schematic diagram of an INT detection packet format of a fault detection method for a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 6 is a schematic diagram of an INT detection process of a fault detection method for a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 7 is a schematic diagram of a general data packet format of a fault detection method for a data center network based on an in-band network telemetry system in an embodiment of the present application.
Fig. 8 is a schematic diagram of a general packet forwarding flow of a method for detecting a failure in a data center network based on an in-band network telemetry system in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The inventor researches and discovers that in the related art, a remote centralized controller is used for planning a plurality of communication paths according to network topology, TCP or HTTP ping transmitted by a server is used, and whether the corresponding link is normal or not is deduced according to the time of connection establishment. Although for detecting most network faults. A limitation is that it is difficult to locate the faulty device accurately in grey fault scenarios unless the traceroute technique is further used, i.e. the sent ping packet reports the current status to the remote centralized controller every time it passes through a switching device.
In addition, in other related technologies, the probe packet is sent through an IP-in-IP tunnel by the server, and the probe packet is returned through an original path after reaching the destination switch. After the detection packets are received by the receiving end, according to the ratio of the number of the received detection packets to the number of the sent detection packets, a certain learning algorithm is used for estimating which link in the network is most likely to be the position where the network fault occurs. However, due to the detection of the link on-off failure, in the case of a partial gray failure, such as a specific destination IP packet loss, the failure cannot be detected. Meanwhile, because a machine learning algorithm is adopted to guess the fault position, zero false positive and zero false negative cannot be guaranteed theoretically.
In consideration of the problem that multiple faults occur in the data center network at the same time, rapid detection needs to be carried out, and multiple faults occur in the data center network at the same time, the application provides the fault detection method of the data center network based on the in-band network telemetry system, which can realize the whole-network telemetry based on the topological characteristic of the data center network, timely adjust the flow rate aiming at the network faults and realize the network fault location including the gray faults.
Referring to fig. 1, a schematic diagram of a system architecture of a fault detection method for a data center network based on an in-band network telemetry system in an embodiment of the present application is shown, where the schematic diagram specifically includes: the system comprises an in-band whole-network remote measuring system, a network fault detection and fast rerouting system and a network fault reporting and positioning system.
The server periodically sends an INT (In-band Network Telemetry) detection packet to check whether all paths between a source end and a destination end are feasible, the INT detection packet records a corresponding switch ID, an input port number and an output port number after passing through a switch, and the switch multicasts the detection packet according to a certain forwarding rule to cover all feasible paths.
And each probe packet receiving end server stores the feasible paths in a path information table, and each path is accompanied by aging time. According to the path information table, the server may send the data packet through a specified feasible path through Source Routing (SR). When a network failure occurs, since the probe packet cannot pass through the failed link, the relevant paths in the path information table will age, and these paths will not be selected as the paths for sending traffic. In addition, the aged path is reported to the remote centralized controller as path failure information.
Since the path failure information on a single server is not sufficient for accurate network failure localization, the controller needs to receive path failure information from all affected servers and perform centralized analysis. More specifically, the controller needs to find the common point between these aged paths until the failure converges to a certain link between two devices.
The present application provides an embodiment of a method for detecting a fault of a data center network based on an in-band network telemetry system, which is described with reference to fig. 2, and the method for detecting a fault of a data center network based on an in-band network telemetry system specifically includes the following contents:
step S201, generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path;
step S202, analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults;
step S203, detecting whether the network fault exists in the server based on the preset aging time in the path information table;
step S204, if the network fault exists in the server, rerouting is carried out through the source route;
step S205, collectively perform network fault location by collecting fault path information uploaded by a plurality of servers.
As can be seen from the above description, an INT detection packet is generated based on an in-band network telemetry system, and the INT detection packet is forwarded in the data center network according to a preset detection path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the server, rerouting is carried out through the source route; and the network fault location is carried out in a centralized way by collecting the fault path information uploaded by the plurality of servers. By the method and the device, the whole network remote measurement is realized based on the network topology characteristics of the data center, the flow is adjusted in time aiming at the network fault, and the network fault location including the gray fault is realized. Therefore, the problems that multiple faults occur simultaneously in the data center network, rapid detection is needed, and multiple faults need to be located are solved.
In step S201, In-band Network Telemetry (INT for short) is a fine-grained Network measurement architecture, and is mainly used for collecting real-time Network link states In a data plane without excessive participation of a control plane. In the INT model, a sending end sends a special INT detection packet for full-network traversal, and when the INT detection packet passes through devices with INT information acquisition functions along a detection path, the devices insert corresponding INT information into the detection packet. The INT detection packet is finally collected and processed by a remote centralized controller and is used for realizing whole network congestion control, load balancing, network fault detection and the like.
The INT has the function of detecting the state of the network link, so that the INT can be used for detecting the on-off condition of the network link in real time, and gray fault finding and positioning can be carried out on the basis of the on-off condition.
In order to enable INT probe packets to cover the entire network, specific forwarding rules need to be set in the switches according to a specific network topology. Based on the scheme of performing full-network detection on the data center network in the prior art, for example, in the data center network with a three-layer Fat-tree architecture shown in fig. 4, the forwarding rule set in the switch is performed in the following manner:
for the ToR switch, when it receives the probe packet sent from the server, the probe packet will be forwarded to all Leaf switches connected to it (such as fig. 4, Sever1 → T1 → L1 and L2). When it receives a probe packet from a Leaf switch, the probe packet will be forwarded to all servers connected to it (e.g., fig. 4, L3 → T3 → Server5 and Server 6).
For a Leaf switch, when it receives a probe packet sent from a ToR switch, the probe packet will be forwarded to all ports except the ingress port (e.g., fig. 4, T1 → L1 → S1, S2, and T2). When it receives a probe packet from a Spine switch, the probe packet will be forwarded to all ToR switches connected to it (e.g., fig. 4, S1 → L3 → T3 and T4).
For a Spine switch, when it receives a probe packet sent from a Leaf switch, the probe packet will be sent to all ports except the ingress port (e.g., fig. 4, L1 → S1 → L3).
As a preferred embodiment of the present application, the INT detection packet adopts a following message format including an ethernet header and an IP header, and the acquired information format of the INT detection packet includes a switch ID, an ingress port number, and an egress port number.
The INT probe packet is generated using the server. In the Fat-tree architecture data center network, servers connected under one ToR switch share the same path information, so that each server does not need to participate in sending the INT detection packet, and only one server needs to be designated as an INT detection packet sending end under each ToR switch.
Further, probe packet processing is performed on the switch side. In a standard INT framework, a device that supports INT functionality can provide a number of link state information including switch ID, ingress and egress port numbers, queue depth, forwarding latency, etc. The invention is mainly used for detecting and positioning link faults, so that the INT detection packet is simplified, and the INT detection packet occupies network bandwidth resources as little as possible while realizing the required functions. The hardware information that the simplified INT detection packet needs to collect is:
switch _ id (8-bit): switch IDs, assigned by the remote centralized controller, the ID of each switch being unique.
Ingress _ port (8-bit): and the port number is entered, and the detection packet enters the switch through the port.
Egress port (8-bit) the port number through which the probe packet leaves the switch.
In specific implementation, as shown in fig. 5, the initial probe packet generated by the server at the sending end of the INT probe packet is composed of an ethernet header and an IP header. The protocol field in the IP header is 0x700, indicating that the packet is a simplified INT probe packet. As the probe packet passes through the network as shown in fig. 6, the switch through will append the required INT information behind the IP header of the INT probe packet, i.e., the INT information in the probe packet is in the form of a stack.
In the step S202, the INT probe packet is analyzed by the receiving end in the data center network, and is stored in the local database in the form of the path information table. That is, after the INT detection packet receiving end receives the detection packet, the INT information carried in the detection packet is analyzed and stored in the local database in the form of the path information table.
It should be noted that the path information table at least includes a preset aging time of each path, and the preset aging time is used for detecting a network fault. For example, the path information table stored in server a contains all feasible paths from server a to all other accessible servers. Each path is represented by a combination of switch IDs and corresponding ingress and egress port numbers that pass along the way.
In step S203, whether the network failure exists in the server is detected based on the preset aging time in the path information table. I.e. network failures including grey failures are discovered by path aging.
In specific implementation, each path in the path information table has an aging time, that is, no new probe packet arrives within a certain time to update the state of the relevant path, and the paths are deleted. Therefore, if there is a link failure in the network, the related probe packet cannot reach the receiving end, the related path stored on the receiving end is aged, and the server therefore determines that a failure exists in a certain part of the affected path. It should be noted that when a link is congested, the probe packet may not reach the receiving end in a short time, and the server may misunderstand that the path is failed. Therefore, in actual operation, the aging time of the path in the path information table can be increased appropriately, so as to filter the situation of partial link congestion. For very severe congestion, it may not be possible to filter it even with increasing aging time, but in this case the link is reasonably equivalent to a link failure and handled.
If the network failure is detected in the server in the above step S204, rerouting is performed through the source route. Fast reroute is possible via source route.
In specific implementation, as shown in fig. 7, the IP header protocol field of the normal stream packet is 0x701, and SR information indicating a next hop, such as a port number, is arranged behind the IP header in a queue. As shown in fig. 8, when a packet passes through the switch, the switch forwards the packet to the designated egress port according to the first (i.e., leftmost) SR information resolved, and deletes the SR information, thereby ensuring that the used SR information is not reused by subsequent switches. When a link failure is detected, the path information table in the server is updated, the affected path is deleted due to aging, and all subsequent data packets cannot pass through the failed link, so that rapid rerouting of subsequent flow is realized within an aging time.
In step S205 described above, although fault detection and fast reroute have been implemented in the data plane, fault location is required in order to repair the fault. Since there is no shared full network topology between distributed servers, the failure path taken by each server alone is not sufficient to infer a specific failure location. Therefore, in the embodiment of the present application, a remote centralized controller is introduced based on SDN design specifications to collect the failure path information uploaded by each server, and network failure location is performed centrally through the failure path information uploaded by a plurality of servers.
Based on the steps, the whole network telemetry based on the network topology characteristics of the data center is realized. Full network telemetry needs to cover every port of every device, considering that grey faults may occur on any port of any device in the DCN, possibly more than one. But in a data center network, simple broadcasting introduces loops. According to the data center network all-network remote measurement scheme, the structure of the data center network is researched, and a data center network all-network remote measurement scheme which is multicast, covers the whole network and is loop-free based on topology is adopted.
Based on the above steps, the method and the device realize timely flow adjustment for network faults, and even if a network operator or an end user can notice abnormal flow behaviors, it usually takes a long time to locate the grey faults before the flow is redeployed to bypass a fault area because the grey faults can cause a large amount of silent packet loss. The longer it takes to locate the fault, the greater the loss of flow. In addition, the loss of traffic will further trigger packet retransmissions on the end hosts, resulting in network congestion. The application provides a strategy for immediately adjusting the flow after the data plane senses the fault, and the fault location is not required to be completed firstly.
Based on the above steps, the present application realizes network fault location including gray faults, and through whole network telemetry and local rerouting decision, the data plane can directly perform traffic adjustment to avoid the fault area, but in addition, the network operator needs to master the specific location of the fault, thereby completely repairing the fault. The application provides a network fault positioning strategy including grey faults based on the whole network remote measurement results, and provides a repair suggestion. Preferably, the present invention is not only directed to the case where a single location fails, but multiple locations in a data center network may still be detected and fault location may be implemented by the present invention.
As a preference in this embodiment, the rerouting by the source route further includes: adding a source routing field into a data packet of the data center network; determining a path for forwarding the data packet according to the source routing field; and when the network fault exists in the server, deleting the aged path, updating the path information table, and then restoring the link.
In specific implementation, in the system, an SR field is added in a normal packet. SR, Source routing, refers to a mechanism for attaching hop-by-hop routing information to a packet at a packet sending end for specifying an exact path for forwarding the packet, that is, the switch completely forwards a normal packet according to the read SR information, and does not rely on a conventional routing table lookup manner for forwarding.
As a preferable preference in this embodiment, the detecting whether the network fault exists in the server based on the preset aging time in the path information table includes: detecting whether the INT detection packet is received by the receiving end or not based on the preset aging time of the path in the path information table; if the INT detection packet is detected not to be received by the receiving end, deleting the path and considering that the network fault exists in the server, wherein the network fault at least comprises one of the following steps: link failure in which the probe packet cannot reach the receiving end due to congestion of a link, and link failure in which the probe packet cannot reach the receiving end and a path stored on the receiving end is aged.
In specific implementation, each path in the path information table has an aging time, that is, no new probe packet arrives within a certain time to update the state of the relevant path, and the paths are deleted. Therefore, if there is a link failure in the network, the related probe packet cannot reach the receiving end, the related path stored on the receiving end is aged, and the server therefore determines that a failure exists in a certain part of the affected path. It should be noted that when a link is congested, the probe packet may not reach the receiving end in a short time, and the server may misunderstand that the path is failed. Therefore, in actual operation, the aging time of the path in the path information table can be increased appropriately, so as to filter the situation of partial link congestion. For very severe congestion, it may not be possible to filter it even with increasing aging time, but in this case the link is reasonably equivalent to a link failure and handled.
As a preferable example in this embodiment, after detecting whether the network failure exists in the server based on the preset aging time in the path information table, the method further includes: obtaining a repair evaluation result according to a network fault positioning strategy, wherein the network fault at least comprises a grey fault; detecting the network fault in the data center according to a preset probability model, a preset error scale and a K value of a preset Top K algorithm; the preset probability model comprises a first probability model and a second probability model, wherein the first probability model is used for setting different error occurrence probabilities according to link conditions, and the second probability model is unified according to the error occurrence probabilities of all links; the preset error scale is used as the proportion of different error scales, and the proportion of the error scales is the total number of the error link numbers/topology; and the K value of the preset Top K algorithm is determined according to the number of fault repairing rounds, the time consumption of each round and the accuracy.
In specific implementation, network faults including gray faults need to be located. Three sets of variables are set, namely different probability models, different error sizes and different K values (Top K algorithm).
(1) Probabilistic model
Two probability models are set according to the link condition. First, different probabilities of error occurrence are set according to link conditions. Through investigation, the data center is found to be more prone to failure and aging than other links due to the fact that the links are made of different materials and different flow scales. The probability set by each link is different when calculating the link error. The differential probability is set according to the characteristics of the link, so that the system is closer to the actual data center network structure. Because simulation software is used, different probabilities are randomly generated for each link, and in an actual data center network, a link error model can be established through a large amount of data, so that the probability of each link error is counted, and the obtained result is more accurate. And secondly, unifying the error probability of all links. Unlike the first case, the difference in the error occurrence probability of each link in the data center network is ignored, and the error occurrence probability of all links is set to P. I.e. each link has equal probability of error and aging. This is desirable and may be better able to verify the functionality of the system from a quantitative perspective. In both cases, the number of fail-over rounds, the time consumed per round and the accuracy rate are obtained. The influence of different probability models on the data center network fault detection is researched by comparing various indexes under the two probability models.
(2) Error scale
In order to determine the performance of the fault detection method in the case of different error sizes, a very important parameter is set, namely the proportion of the error sizes (number of error occurring links/total number of errors in the topology). The number of fault repairing rounds, the time consumption of each round and the accuracy of the system are tested when the links have different error scales, namely the error ratios of the links are different. The test process adopts the idea that the error rate of the link is increased by 10% each time from 10% to 100%. By controlling the difference of the error scale of the link, the performance of the system in the data center networks with different error scale sizes is further qualitatively researched.
(3) Different K value
The TOP K algorithm is a very important algorithm in the system, and not only affects the efficiency of the system, but also different K values have great influence on the accuracy of the system. The previously mentioned Top K algorithm is derived to know the accuracy and time consumption that the Top K algorithm can greatly affect. Therefore, the number of fault repairing rounds, the time consumption of each round and the accuracy of the system are observed according to different K values. In order to find out how large a suitable value of K is to be applied to the data center network.
Qualitative and quantitative analysis test results: the repair is more accurate as the value of K is higher, but the time spent in each round increases as the value of K increases. That is, to obtain a more accurate inspection, efficiency must be sacrificed, i.e., more time must be spent in performing the calculations. Similarly, the greater the magnitude of data center network errors, the greater the accuracy of the detection, but this also results in longer detection times. Therefore, according to the characteristics of different data center networks, the value of the K value needs to be continuously adjusted to obtain an optimal solution.
As a preferred embodiment of the present invention, the performing network fault location in a centralized manner by collecting fault path information uploaded by a plurality of servers includes: monitoring a Redis database to obtain the change of link information, and putting the changed link information into a target set, wherein the target set comprises a set of disappearing paths and a set of repairing paths; and calculating the probability of the wrong link according to the data in the vanished path set.
In specific implementation, the change of the link information can be obtained by monitoring the Redis database. When a change is known, the information of the change is put into the collection. Considering other reasons such as network congestion or packet loss caused by bad link state, after the link disappears for a certain time, the link is marked as an aging circuit. Two sets are constructed, and the set of the lost path and the set of the repair path are respectively represented. And calculating the probability of the wrong link through data in the vanished path set, thereby giving a repair suggestion. When a link is repaired (not necessarily completely repaired), some paths are added to the database, the added paths are placed in the repair path set, and the lost paths are eliminated from the repair path set. And calculating the updated set of the wrong links again, and giving the error probability and the repair suggestion of the links for repair. This process is cycled until the set of vanished paths is 0, indicating that all paths have been repaired, i.e. the network returns to normal.
As a preference in the present embodiment, the present invention further includes: calculating the fault path through a fault tree analysis algorithm; calculating the error position in the link by a fault tree analysis algorithm; and analyzing and positioning the error positions in a centralized manner through a fault tree analysis algorithm and a plurality of fault paths, and optimizing in the process of the fault tree analysis algorithm according to the minimal cut set and the Top K algorithm.
In specific implementation, a fault path is calculated through a fault tree analysis algorithm. Fault Tree Analysis (FTA) is a method for analyzing the occurrence of top events or base events by using boolean logic algorithm from top to bottom according to the relationship between the top events and the base events.
The state of network A is determined by P1 and P2. Wherein the state of P1 is L1 occurrence, L2 occurrence or L3 occurrence, namely P1 ═ L1 or L2 or L3 ═ L1+ L2+ L3. The same can be obtained from P2 ═ L1 or L2 or L4 ═ L1+ L2+ L4. And state a occurs under the condition that P1 and P2 occur simultaneously, i.e., a ═ P1 and P2 ═ P1 × P2.
In summary, a ═ P1P2
=(L1+L2+L3)(L1+L2+L4)
=L1L1+L1L2+L1L4+L2L1+L2L2+L2L4+L3L1+L3L2+L3L4
=L1+L1L2+L1L3+L1L4+L2+L2L3+L2L4+L3L4
The combination of faults leading to the overhead state a can be obtained by logically and-computing the successively received fault paths and then developing them into and-or equations. The sorting can be carried out according to the failure probability of each link, and reference is provided for operation and maintenance personnel.
The two paths are logically anded and expanded to be too long to resemble polynomial multiplication expansion. For example, (L1+ L2) (L1+ L3), 4 operations are required to obtain L1+ L1L2+ L1L3+ L2L 3. If a path is received and calculated, 8 operations are required. Therefore, if each path is assumed to jump by k, n fault paths are totally processed by logic AND and expanded, the result of the previous n-1 times of calculation has k ^ (n-1) items, the nth path has k items, and the algorithm complexity of the logic AND is O (k ^ n).
And optimizing a fault tree analysis algorithm. First, reduce the cut set to the minimum cut set, make the following derivation: assuming that the condition of occurrence of event a is that a is L1+ L1L2+ L1L3+ L1L4+ L2+ L2L3+ L2L4+ L3L4, 8 cut sets can make a stand alone, but L1 and L1L2 have an inclusive relationship, that is, if L1 is satisfied, event a must occur regardless of whether L1L2 is satisfied. The cut-set is simplified to the minimum cut-set a-L1 + L2+ L3L4 according to the axiom of absorptivity in discrete mathematics. This way, the accuracy of calculation is reduced, a part of fault combinations are abandoned, but the algorithm efficiency can be greatly improved. Second, the items with lower probability of minimal cut set are removed. Assuming that the failures of each link are independent of each other and the probability is 0.1, the probabilities of the occurrence of L1 and L2 and L3L4 are significantly different (the probability of the occurrence of an error in L1 is 0.1, the probability of the occurrence of an error in L2 is 0.1, and the probability of the occurrence of an error in L3L4 is 0.1 × 0.1 — 0.01). Therefore, the previous K error paths with the maximum probability can be recommended each time, and the working efficiency of operation and maintenance personnel is improved. The implementation of the method mainly traverses all the error paths, selects the first K paths and then recommends the paths to operation and maintenance personnel. So the solution of Top K algorithm is adopted.
Based on the above steps, the link error probability is calculated through the lost path in the database, and since each link is used by multiple paths in the data center, when the link fails, multiple paths are affected. And after the disappearing paths are collected by the remote centralized controller, the probability of the failures of the disappearing paths is calculated and ranked, and a suggestion of repairing the paths is given. The probability of the error path is mainly obtained through logic operation according to the link in the vanishing path set.
In addition, the position of the link where the error occurs is calculated through a fault tree analysis algorithm, and the position of the specific position where the error occurs is located through the fault tree analysis algorithm and concentrated analysis according to a plurality of paths where the fault occurs. Meanwhile, optimization is performed according to the minimum cut set and the Top K algorithm, so that the operation cost and time are reduced.
According to an embodiment of the present application, there is also provided a fault detection apparatus for a data center network based on an in-band network telemetry system, as shown in fig. 3, the apparatus including:
an INT detection module 301, configured to generate an INT detection packet based on an in-band network telemetry system, and forward the INT detection packet in the data center network according to a preset detection path;
an analyzing module 302, configured to analyze the INT probe packet through a receiving end in the data center network, and store the INT probe packet in a local database in a form of a path information table, where the path information table at least includes a preset aging time of each path, and the preset aging time is used to detect a network fault;
a network failure detection module 303, configured to detect whether the network failure exists in the server based on the preset aging time in the path information table;
a rerouting module 304, configured to perform rerouting through a source route when detecting that the network failure exists in the server;
and a fault location module 305, configured to perform centralized network fault location by collecting fault path information uploaded by multiple servers.
In the INT detection module 301, INT is a fine-grained network measurement architecture, and is mainly used for collecting real-time network link states in a data plane without excessive participation of a control plane. In the INT model, a sending end sends a special INT detection packet for full-network traversal, and when the INT detection packet passes through devices with INT information acquisition functions along a detection path, the devices insert corresponding INT information into the detection packet. The INT detection packet is finally collected and processed by a remote centralized controller and is used for realizing whole network congestion control, load balancing, network fault detection and the like.
The INT probe packet is parsed by a receiving end in the data center network in the parsing module 302 and stored in a local database in a form of a path information table. That is, after the INT detection packet receiving end receives the detection packet, the INT information carried in the detection packet is analyzed and stored in the local database in the form of the path information table.
It should be noted that the path information table at least includes a preset aging time of each path, and the preset aging time is used for detecting a network fault. For example, the path information table stored in server a contains all feasible paths from server a to all other accessible servers. Each path is represented by a combination of switch IDs and corresponding ingress and egress port numbers that pass along the way.
In the network failure detection module 303, whether the network failure exists in the server is detected based on the preset aging time in the path information table. I.e. network failures including grey failures are discovered by path aging.
In specific implementation, each path in the path information table has an aging time, that is, no new probe packet arrives within a certain time to update the state of the relevant path, and the paths are deleted. Therefore, if there is a link failure in the network, the related probe packet cannot reach the receiving end, the related path stored on the receiving end is aged, and the server therefore determines that a failure exists in a certain part of the affected path. It should be noted that when a link is congested, the probe packet may not reach the receiving end in a short time, and the server may misunderstand that the path is failed. Therefore, in actual operation, the aging time of the path in the path information table can be increased appropriately, so as to filter the situation of partial link congestion. For very severe congestion, it may not be possible to filter it even with increasing aging time, but in this case the link is reasonably equivalent to a link failure and handled.
Rerouting is performed by source routing if the presence of the network failure in the server is detected in the rerouting module 304. Fast reroute is possible via source route.
Although fault detection and fast reroute have been implemented in the data plane in the fault location module 305, fault location is still required to repair the fault. Since there is no shared full network topology between distributed servers, the failure path taken by each server alone is not sufficient to infer a specific failure location. In the embodiment of the application, a remote centralized controller is introduced based on the SDN design specification in the prior art to collect the failure path information uploaded by each server, and network failure location is performed in a centralized manner through the failure path information uploaded by a plurality of servers.
In order to further explain the present solution, the present application further provides a specific application example of the method for detecting a failure in a data center network based on an in-band network telemetry system, which specifically includes the following contents:
the system based on the fault detection party of the data center network based on the in-band network telemetry system comprises: the system comprises an in-band whole-network remote measuring system, a network fault detection and fast rerouting system and a network fault reporting and positioning system.
The server periodically sends an INT detection packet to check whether all paths between the source end and the destination end are feasible, the INT detection packet records a corresponding switch ID, an inlet port number and an outlet port number after passing through the switch, and the switch multicasts the detection packet according to a certain forwarding rule to cover all feasible paths.
And each probe packet receiving end server stores the feasible paths in a path information table, and each path is accompanied by aging time. According to the path information table, the server may send the data packet through a specified feasible path through Source Routing (SR). When a network failure occurs, since the probe packet cannot pass through the failed link, the relevant paths in the path information table will age, and these paths will not be selected as the paths for sending traffic. In addition, the aged path is reported to the remote centralized controller as path failure information.
Since the path failure information on a single server is not sufficient for accurate network failure localization, the controller needs to receive path failure information from all affected servers and perform centralized analysis. More specifically, the controller needs to find the common point between these aged paths until the failure converges to a certain link between two devices.
(a) In-band whole-network remote measuring system
1.1 planning the detection path covering the whole network. In order to enable INT probe packets to cover the entire network, specific forwarding rules need to be set in the switches according to a specific network topology.
1.2 generating INT probe packets using the server. In the Fat-tree architecture data center network, servers connected under one ToR switch share the same path information, so that each server does not need to participate in sending the INT detection packet, and only one server needs to be designated as an INT detection packet sending end under each ToR switch.
The INT specification indicates hardware information that the equipment supporting INT can provide, but for the problem of fault detection, only the detection packet needs to acquire the ID, the number of an input port and the number of an output port of the switch. Therefore, the format of the INT detection packet message facing fault detection is an Ethernet head and an IP head (the protocol field is 0x700), and the format of the acquired INT information is a switch ID (8-bit), an input port number (8-bit) and an output port number (8-bit).
1.3 Probe packet processing on the exchange side. In a standard INT framework, a device that supports INT functionality can provide a number of link state information including switch ID, ingress and egress port numbers, queue depth, forwarding latency, etc. The invention is mainly used for detecting and positioning link faults, so that the INT detection packet is simplified, and the INT detection packet occupies network bandwidth resources as little as possible while realizing the required functions.
(b) Network fault detection and fast rerouting system
2.1 path information table. And after the INT detection packet receiving end receives the detection packet, the INT information carried in the detection packet is analyzed and stored in a local database in a path information table form.
Network failures including gray failures are discovered through path aging. Each path in the path information table has an aging time, that is, no new probe packet arrives within a certain time to update the state of the relevant path, and the paths are deleted. Therefore, if there is a link failure in the network, the related probe packet cannot reach the receiving end, the related path stored on the receiving end is aged, and the server therefore determines that a failure exists in a certain part of the affected path. It should be noted that when a link is congested, the probe packet may not reach the receiving end in a short time, and the server may misunderstand that the path is failed. Therefore, in actual operation, the aging time of the path in the path information table can be increased appropriately, so as to filter the situation of partial link congestion. For very severe congestion, it may not be possible to filter it even with increasing aging time, but in this case the link is reasonably equivalent to a link failure and handled.
And regularly detecting the on-off state of the feasible path in the network by using an INT detection packet, maintaining a path information table in the server to store the feasible path, wherein each piece of path information has aging time, and deleting the path information if a new INT detection packet is not received within a period of time to refresh the relevant path information. Since the network failure is the main reason that the INT detection packet cannot arrive at the receiving end on time, the aging of the path information can be used as a basis for judging whether the failure exists in the network.
2.2 fast reroute over source route. In the system, an SR field is added in a common data packet. SR, Source routing, refers to a mechanism for attaching hop-by-hop routing information to a packet at a packet sending end for specifying an exact path for forwarding the packet, that is, the switch completely forwards a normal packet according to the read SR information, and does not rely on a conventional routing table lookup manner for forwarding.
The data packet forwarding is carried out based on the SR mechanism, and the great advantage is that the data packet sending end can completely specify the forwarding path of the data packet, so that after the sending end server senses the fault, the subsequent flow can be rapidly rerouted to other feasible paths by reselecting the path in the path information table, and the loss caused by packet loss due to link fault is reduced.
(c) Network fault reporting and locating system
3.1 remote notification of failures. Although fault detection and fast reroute have been implemented in the data plane, fault location is still required to repair the fault. Since there is no shared full network topology between distributed servers, the failure path taken by each server alone is not sufficient to infer a specific failure location. The invention introduces a remote centralized controller according to SDN design specifications to collect the fault path information uploaded by each server and perform centralized network fault positioning. By listening to the Redis database, the change of the link information can be obtained. When a change is known, the information of the change is put into the collection. Considering other reasons such as network congestion or packet loss caused by bad link state, the link is marked as an aging circuit after the link disappears for a certain time. Two sets are constructed, representing a set of vanishing paths and a set of repairing paths, respectively. And calculating the probability of the wrong link through data in the vanished path set, thereby giving a repair suggestion. When a link is repaired (not necessarily completely repaired), some paths are added to the database, the added paths are placed in the repair path set, and the lost paths are eliminated from the repair path set. And calculating the updated set of the wrong links again, and giving the error probability and the repair suggestion of the links for repair. This process is cycled until the set of vanished paths is 0, indicating that all paths have been repaired, i.e. the network returns to normal.
3.2 calculating the fault path by a fault tree analysis algorithm. Fault Tree Analysis (FTA) is a method for analyzing the occurrence of top events or base events by using boolean logic algorithm from top to bottom according to the relationship between the top events and the base events.
3.3 fault tree analysis algorithm optimization. First, reduce the cut set to the minimum cut set, make the following derivation: assuming that the condition of occurrence of event a is that a is L1+ L1L2+ L1L3+ L1L4+ L2+ L2L3+ L2L4+ L3L4, 8 cut sets can make a stand alone, but L1 and L1L2 have an inclusive relationship, that is, if L1 is satisfied, event a must occur regardless of whether L1L2 is satisfied. The cut-set is simplified to the minimum cut-set a-L1 + L2+ L3L4 according to the axiom of absorptivity in discrete mathematics. This way, the accuracy of calculation is reduced, a part of fault combinations are abandoned, but the algorithm efficiency can be greatly improved. Second, the items with lower probability of minimal cut set are removed. Assuming that the failures of each link are independent of each other and the probability is 0.1, the probabilities of the occurrence of L1 and L2 and L3L4 are significantly different (the probability of the occurrence of an error in L1 is 0.1, the probability of the occurrence of an error in L2 is 0.1, and the probability of the occurrence of an error in L3L4 is 0.1 × 0.1 — 0.01). Therefore, the previous K error paths with the maximum probability can be recommended each time, and the working efficiency of operation and maintenance personnel is improved. The implementation of the method mainly traverses all the error paths, selects the first K paths and then recommends the paths to operation and maintenance personnel. So the solution of Top K algorithm is adopted.
The probability of link error is calculated through the lost path in the database, and each link is used by a plurality of paths in the data center, so when the link fails, the paths are affected. And after the disappearing paths are collected by the remote centralized controller, the probability of the failures of the disappearing paths is calculated and ranked, and a suggestion of repairing the paths is given. The probability of the error path is mainly obtained through logic operation according to the link in the vanishing path set.
And calculating the position of the link with the error through a fault tree analysis algorithm, and performing centralized analysis according to a plurality of paths with the errors through the fault tree analysis algorithm to locate the specific position with the error. Meanwhile, optimization is performed according to the minimum cut set and the Top K algorithm, so that the operation cost and time are reduced.
An embodiment of the present application further provides a specific implementation manner of an electronic device capable of implementing all steps in the method for parsing industrial internet identification information based on a block chain in the foregoing embodiment, where a processor is configured to call a computer program in a memory, and when the processor executes the computer program, all steps in the method for parsing industrial internet identification information based on a block chain in the foregoing embodiment are implemented, for example, when the processor executes the computer program, the following steps are implemented:
step 100: generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path;
step 200: analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults;
step 300: detecting whether the network fault exists in a server or not based on the preset aging time in the path information table;
step 400: if the network fault exists in the server, rerouting is carried out through the source route;
step 500: and the network fault location is carried out in a centralized way by collecting the fault path information uploaded by the plurality of servers.
Embodiments of the present application further provide a computer-readable storage medium capable of implementing all steps in the method for detecting a failure of a data center network based on an in-band network telemetry system in the foregoing embodiments, where the computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program implements all steps of the method for detecting a failure of a data center network based on an in-band network telemetry system in the foregoing embodiments, for example, when the processor executes the computer program, the processor implements the following steps:
step 100: generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path;
step 200: analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults;
step 300: detecting whether the network fault exists in a server or not based on the preset aging time in the path information table;
step 400: if the network fault exists in the server, rerouting is carried out through the source route;
step 500: and the network fault location is carried out in a centralized way by collecting the fault path information uploaded by the plurality of servers.
As can be seen from the above description, the computer-readable storage medium provided in the embodiments of the present application can generate an INT detection packet based on an in-band network telemetry system, and forward the INT detection packet in the data center network according to a preset detection path; analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults; detecting whether the network fault exists in a server or not based on the preset aging time in the path information table; if the network fault exists in the server, rerouting is carried out through the source route; and the network fault location is carried out in a centralized way by collecting the fault path information uploaded by the plurality of servers. By the method and the device, the whole network remote measurement is carried out based on the topological characteristics of the data center network, the flow is timely adjusted according to the network fault, and the network fault location including the gray fault is realized, so that the problems that multiple faults simultaneously occurring in the data center network need to be rapidly detected and multiple faults simultaneously occurring in the data center network are solved.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the hardware + program class embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Although the present application provides method steps as described in an embodiment or flowchart, additional or fewer steps may be included based on conventional or non-inventive efforts. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or client product executes, it may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a vehicle-mounted human-computer interaction device, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.
The embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The described embodiments may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and variations to the embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present specification should be included in the scope of the claims of the embodiments of the present specification.

Claims (10)

1. A fault detection method of a data center network based on an in-band network telemetry system is characterized by comprising the following steps:
generating an INT detection packet based on an in-band network telemetry system, and forwarding the INT detection packet in the data center network according to a preset detection path;
analyzing the INT detection packet through a receiving end in the data center network, and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults;
detecting whether the network fault exists in a server or not based on the preset aging time in the path information table;
if the network fault exists in the server, rerouting is carried out through the source route;
and the network fault location is carried out in a centralized way by collecting the fault path information uploaded by the plurality of servers.
2. The method of claim 1, wherein the rerouting via a source route further comprises:
adding a source routing field into a data packet of the data center network;
determining a path for forwarding the data packet according to the source routing field;
and when the network fault exists in the server, deleting the aged path, updating the path information table, and then restoring the link.
3. The method of claim 2, wherein the detecting whether the network failure exists in a server based on the preset aging time in the path information table comprises:
detecting whether the INT detection packet is received by the receiving end or not based on the preset aging time of the path in the path information table;
if the INT detection packet is detected not to be received by the receiving end, deleting the path and considering that the network fault exists in the server, wherein the network fault at least comprises one of the following steps: link failure in which the probe packet cannot reach the receiving end due to congestion of a link, and link failure in which the probe packet cannot reach the receiving end and a path stored on the receiving end is aged.
4. The method according to claim 2, wherein after detecting whether the network failure exists in the server based on the preset aging time in the path information table, the method further comprises:
obtaining a repair evaluation result according to a network fault positioning strategy, wherein the network fault at least comprises a grey fault;
detecting the network fault in the data center according to a preset probability model, a preset error scale and a K value of a preset Top K algorithm;
the preset probability model comprises a first probability model and a second probability model, wherein the first probability model is used for setting different error occurrence probabilities according to link conditions, and the second probability model is unified according to the error occurrence probabilities of all links;
the preset error scale is used as the proportion of different error scales, and the proportion of the error scales is the total number of the error link numbers/topology;
and the K value of the preset Top K algorithm is determined according to the number of fault repairing rounds, the time consumption of each round and the accuracy.
5. The method according to claim 1, wherein the INT probe packet is in a message format comprising an ethernet header and an IP header, and the collected information format of the INT probe packet comprises a switch ID, an ingress port number, and an egress port number.
6. The method of claim 1, wherein the centralized network fault location by collecting fault path information uploaded by a plurality of the servers comprises:
monitoring a Redis database to obtain the change of link information, and putting the changed link information into a target set, wherein the target set comprises a set of disappearing paths and a set of repairing paths;
and calculating the probability of the wrong link according to the data in the vanished path set.
7. The method of claim 6, further comprising:
calculating the fault path through a fault tree analysis algorithm;
calculating the error position in the link by a fault tree analysis algorithm;
and analyzing and positioning the error positions in a centralized manner through a fault tree analysis algorithm and a plurality of fault paths, and optimizing in the process of the fault tree analysis algorithm according to the minimal cut set and the Top K algorithm.
8. A device for detecting a failure in a data center network based on an in-band network telemetry system, comprising:
the INT detection module is used for generating an INT detection packet based on the in-band network telemetry system and forwarding the INT detection packet in the data center network according to a preset detection path;
the analysis module is used for analyzing the INT detection packet through a receiving end in the data center network and storing the INT detection packet in a local database according to a form of a path information table, wherein the path information table at least comprises preset aging time of each path, and the preset aging time is used for detecting network faults;
a network fault detection module, configured to detect whether the network fault exists in the server based on the preset aging time in the path information table;
a rerouting module, configured to perform rerouting through a source route when detecting that the network fault exists in the server;
and the fault positioning module is used for carrying out centralized network fault positioning by collecting the fault path information uploaded by the plurality of servers.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method for fault detection in an in-band network telemetry system based data center network of any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for fault detection in a data centre network based on an in-band network telemetry system of any one of claims 1 to 7.
CN202111027721.8A 2021-09-02 2021-09-02 Data center network fault detection method and device based on in-band network telemetry system Active CN113938407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111027721.8A CN113938407B (en) 2021-09-02 2021-09-02 Data center network fault detection method and device based on in-band network telemetry system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111027721.8A CN113938407B (en) 2021-09-02 2021-09-02 Data center network fault detection method and device based on in-band network telemetry system

Publications (2)

Publication Number Publication Date
CN113938407A true CN113938407A (en) 2022-01-14
CN113938407B CN113938407B (en) 2023-06-20

Family

ID=79275037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111027721.8A Active CN113938407B (en) 2021-09-02 2021-09-02 Data center network fault detection method and device based on in-band network telemetry system

Country Status (1)

Country Link
CN (1) CN113938407B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114760225A (en) * 2022-03-31 2022-07-15 深信服科技股份有限公司 Fault diagnosis method, system and storage medium
CN114844708A (en) * 2022-05-07 2022-08-02 长三角信息智能创新研究院 Method, device and storage medium for mitigating flooding attack of link based on traffic rerouting
CN114866431A (en) * 2022-04-28 2022-08-05 深圳智芯微电子科技有限公司 Method and device for predicting SFC network fault based on INT and processor
CN115396355A (en) * 2022-08-25 2022-11-25 北京有竹居网络技术有限公司 Network path detection method and device and electronic equipment
CN116962143A (en) * 2023-09-18 2023-10-27 腾讯科技(深圳)有限公司 Network fault detection method, device, computer equipment and storage medium
CN117411806A (en) * 2023-12-13 2024-01-16 国网浙江省电力有限公司信息通信分公司 Power communication network performance evaluation method, system, equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8989194B1 (en) * 2012-12-18 2015-03-24 Google Inc. Systems and methods for improving network redundancy and for facile initialization in a centrally-controlled network
CN105281945A (en) * 2014-09-19 2016-01-27 中国人民解放军第二炮兵工程大学 Data flow-based deterministic network integrity fault detection method
CN108199924A (en) * 2018-01-26 2018-06-22 北京邮电大学 The whole network traffic visualization method and device based on band network telemetering
CN109787833A (en) * 2019-01-23 2019-05-21 清华大学 Network exception event cognitive method and system
CN110224883A (en) * 2019-05-29 2019-09-10 中南大学 A kind of Grey Fault Diagnosis method applied to telecommunications bearer network
WO2019239189A1 (en) * 2018-06-13 2019-12-19 Telefonaktiebolaget Lm Ericsson (Publ) Robust node failure detection mechanism for sdn controller cluster
CN111581036A (en) * 2020-03-31 2020-08-25 西安电子科技大学 Internet of things fault detection method, detection system and storage medium
CN112422498A (en) * 2020-09-04 2021-02-26 网络通信与安全紫金山实验室 In-band network remote measuring method, system and computer readable storage medium
CN112702330A (en) * 2020-12-21 2021-04-23 网络通信与安全紫金山实验室 Lightweight in-band network telemetry method and device for Overlay network and storage medium
CN112866075A (en) * 2020-12-21 2021-05-28 网络通信与安全紫金山实验室 In-band network telemetering method, system and related device for Overlay network
CN113271225A (en) * 2021-05-18 2021-08-17 浙江大学 Network reliability evaluation method based on in-band network telemetry technology

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8989194B1 (en) * 2012-12-18 2015-03-24 Google Inc. Systems and methods for improving network redundancy and for facile initialization in a centrally-controlled network
CN105281945A (en) * 2014-09-19 2016-01-27 中国人民解放军第二炮兵工程大学 Data flow-based deterministic network integrity fault detection method
CN108199924A (en) * 2018-01-26 2018-06-22 北京邮电大学 The whole network traffic visualization method and device based on band network telemetering
WO2019239189A1 (en) * 2018-06-13 2019-12-19 Telefonaktiebolaget Lm Ericsson (Publ) Robust node failure detection mechanism for sdn controller cluster
CN109787833A (en) * 2019-01-23 2019-05-21 清华大学 Network exception event cognitive method and system
CN110224883A (en) * 2019-05-29 2019-09-10 中南大学 A kind of Grey Fault Diagnosis method applied to telecommunications bearer network
CN111581036A (en) * 2020-03-31 2020-08-25 西安电子科技大学 Internet of things fault detection method, detection system and storage medium
CN112422498A (en) * 2020-09-04 2021-02-26 网络通信与安全紫金山实验室 In-band network remote measuring method, system and computer readable storage medium
CN112702330A (en) * 2020-12-21 2021-04-23 网络通信与安全紫金山实验室 Lightweight in-band network telemetry method and device for Overlay network and storage medium
CN112866075A (en) * 2020-12-21 2021-05-28 网络通信与安全紫金山实验室 In-band network telemetering method, system and related device for Overlay network
CN113271225A (en) * 2021-05-18 2021-08-17 浙江大学 Network reliability evaluation method based on in-band network telemetry technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHENHAO JIA等: "Rapid Detection and Localization of Gray Failures in Data Centers via In-band Network Telemetry", 《2020 IEEE/IFIP NETWORK OPERATIONS AND MANAGERMENT SYMPOSIUM》 *
刘争争等: "基于P4的主动网络遥测机制", 《通信学报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114760225A (en) * 2022-03-31 2022-07-15 深信服科技股份有限公司 Fault diagnosis method, system and storage medium
CN114866431A (en) * 2022-04-28 2022-08-05 深圳智芯微电子科技有限公司 Method and device for predicting SFC network fault based on INT and processor
CN114844708A (en) * 2022-05-07 2022-08-02 长三角信息智能创新研究院 Method, device and storage medium for mitigating flooding attack of link based on traffic rerouting
CN114844708B (en) * 2022-05-07 2024-06-18 长三角信息智能创新研究院 Method, equipment and storage medium for relieving flooding attack based on traffic rerouting link
CN115396355A (en) * 2022-08-25 2022-11-25 北京有竹居网络技术有限公司 Network path detection method and device and electronic equipment
CN116962143A (en) * 2023-09-18 2023-10-27 腾讯科技(深圳)有限公司 Network fault detection method, device, computer equipment and storage medium
CN116962143B (en) * 2023-09-18 2024-01-26 腾讯科技(深圳)有限公司 Network fault detection method, device, computer equipment and storage medium
CN117411806A (en) * 2023-12-13 2024-01-16 国网浙江省电力有限公司信息通信分公司 Power communication network performance evaluation method, system, equipment and storage medium
CN117411806B (en) * 2023-12-13 2024-03-08 国网浙江省电力有限公司信息通信分公司 Power communication network performance evaluation method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN113938407B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN113938407B (en) Data center network fault detection method and device based on in-band network telemetry system
CN111147287B (en) Network simulation method and system in SDN scene
CN112564964B (en) Fault link detection and recovery method based on software defined network
CN109314652B (en) Network performance measurement method and device
CN113315682A (en) Method, system and apparatus for generating information transmission performance warning
US10411972B2 (en) Determining impact of network failures
CN106797328A (en) Collect and analyze selected network traffics
CN105721184A (en) Network link quality monitoring method and apparatus
CN108449210B (en) Network routing fault monitoring system
CN111404822B (en) Data transmission method, device, equipment and computer readable storage medium
CN105379201B (en) Method, controller and the failover interchanger of path switching
CN109039795B (en) Cloud server resource monitoring method and system
CN114422325A (en) Content distribution network abnormity positioning method, device, equipment and storage medium
CN109245961A (en) Link-quality detection method, device, storage medium and equipment
Bouillard et al. Hidden anomaly detection in telecommunication networks
US11695856B2 (en) Scheduling solution configuration method and apparatus, computer readable storage medium thereof, and computer device
Tri et al. Locating deteriorated links by network-assisted multicast proving on OpenFlow networks
CN111835595B (en) Flow data monitoring method, device, equipment and computer storage medium
CN110609761B (en) Method and device for determining fault source, storage medium and electronic equipment
Khalil et al. Dependability: Enablers in 5g campus networks for industry 4.0
CN114172796A (en) Fault positioning method and related device for communication network
CN110830327B (en) Method for realizing process layer network full link monitoring and alarming
CN110572332B (en) Network equipment message observation data acquisition task dividing method
CN116962143B (en) Network fault detection method, device, computer equipment and storage medium
US10320954B2 (en) Diffusing packets to identify faulty network apparatuses in multipath inter-data center networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant