CN114900426B - Fault positioning method based on active and passive hybrid measurement and related equipment - Google Patents

Fault positioning method based on active and passive hybrid measurement and related equipment Download PDF

Info

Publication number
CN114900426B
CN114900426B CN202210466554.5A CN202210466554A CN114900426B CN 114900426 B CN114900426 B CN 114900426B CN 202210466554 A CN202210466554 A CN 202210466554A CN 114900426 B CN114900426 B CN 114900426B
Authority
CN
China
Prior art keywords
fault
module
active
range
passive hybrid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210466554.5A
Other languages
Chinese (zh)
Other versions
CN114900426A (en
Inventor
李清
肖劲宇
左旭东
赵丹
江勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202210466554.5A priority Critical patent/CN114900426B/en
Publication of CN114900426A publication Critical patent/CN114900426A/en
Application granted granted Critical
Publication of CN114900426B publication Critical patent/CN114900426B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0604Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Evolutionary Computation (AREA)
  • Algebra (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a fault locating method based on active and passive hybrid measurement and related equipment, wherein the fault locating method based on active and passive hybrid measurement is applied to a fault locating system based on active and passive hybrid measurement, and the fault locating system based on active and passive hybrid measurement comprises the following steps: a data plane and a control plane; the data plane comprises a large-flow detection module, a feature extraction module and a fault perception module; the control plane comprises a monitoring switch deployment module, a fault range deducing module and a fault accurate positioning module; the monitoring switch deployment module is responsible for deploying the monitoring switch, the large-flow detection module filters out small-flow data belonging to noise, the fault perception module perceives faults, if faults occur, the data stored in the data plane register can be input into the fault range deducing module to output a fault range, and finally the fault accurate positioning module sends a small amount of detection packets to accurately position in the fault range.

Description

Fault positioning method based on active and passive hybrid measurement and related equipment
Technical Field
The present invention relates to the field of network fault location technologies, and in particular, to a fault location method, system, terminal and computer readable storage medium based on active and passive hybrid measurement.
Background
In order to provide high quality transport services for various applications, modern communication networks are continually rapidly growing in size and complexity, typically accommodating tens of thousands of network devices and links. This increase in the size and complexity of the network increases the difficulty of network operation and management, particularly in fault management where a large number of network devices and links mean that the frequency of faults is high, and related reports indicate that links in a data center network fail 5 times a day on average, and network devices fail 40 times a day on average, often taking tens of hours from fault discovery to location resolution.
The rapid development of network applications such as online games, live video, etc., has placed higher demands on the quality of service. While the demands on the network are getting higher and higher, network failures are unavoidable at the same time. Failure of a single link or a single node affects the multiple services to which it relates. Network failure detection directly affects the normal operation and network quality of the network. Therefore, network failure detection is an important research direction in network research, and is receiving more and more attention in the cloud computing era.
Prior art network fault localization schemes, for example based on active measurement of probe packets: the active measurement fault positioning scheme based on the detection packet needs to send the detection packet to corresponding equipment, and when the equipment does not respond after a period of time, the fault is considered to occur; on the one hand, the periodic large number of probe packets can lead to a large network bandwidth overhead and forwarding pressure of the switch; on the other hand, the device needs a longer time delay from receiving the detection packet to responding, resulting in a longer fault location delay. Passive measurements based on network telemetry, for example: the active measurement fault positioning scheme based on the detection packet needs to collect flow characteristics of different levels of a data surface by utilizing a network telemetry technology, such as network telemetry of a data packet level, network telemetry of a data stream level and the like; however, periodic data reporting may bring a large overhead to the uplink bandwidth, and the controller consumes a large amount of storage space for storage. Fault localization based on topological symmetry, for example: the fault locating scheme based on topological symmetry generally utilizes structural topological characteristics, utilizes the characteristics of an equivalent path protocol, adopts a statistical method to discover a truly faulty link by carrying out hypothesis test on flow distribution on different links, but the fault locating scheme only works on regular network topology with higher symmetry and has poorer generality. That is, the existing network fault locating scheme has the disadvantages of high overhead, low accuracy, poor universality, locating time delay and the like. Accordingly, the prior art is still in need of improvement and development.
Disclosure of Invention
The invention mainly aims to provide a fault locating method, a system, a terminal and a computer readable storage medium based on active and passive hybrid measurement, which aim to solve the problem that network fault locating is difficult due to increasingly-growing equipment and links in the Internet in the prior art.
In order to achieve the above object, the present invention provides a fault locating method based on active-passive hybrid measurement, which is applied to a fault locating system based on active-passive hybrid measurement, the fault locating system based on active-passive hybrid measurement includes: a data plane and a control plane; the data plane comprises a large-flow detection module, a feature extraction module and a fault perception module; the control plane comprises a monitoring switch deployment module, a fault range deducing module and a fault accurate positioning module; the fault positioning method based on the active and passive hybrid measurement comprises the following steps:
the large-flow detection module filters small-flow characteristics belonging to noise by using a learning model based on packet level characteristics, and retains the large-flow characteristics;
the feature extraction module extracts source and destination level features according to a time window for each large-flow feature and stores the source and destination level features in a switch buffer area;
the fault perception module uses a machine learning model to check the flow level characteristics of the large flow characteristics so as to detect whether network faults occur, and if the network faults are detected, a warning data packet is sent to the control plane;
the monitoring switch deployment module obtains network topology from the topology manager, selects the position of a monitoring node to be deployed by using a monitor selection algorithm, and deploys the monitoring switch according to the position of the monitoring node;
the fault range deducing module feeds the source and destination level characteristics back to the classifier based on machine learning so as to output a fault range;
the fault accurate positioning module monitors the fault range by sending an active detection packet with a special mark so as to realize accurate fault positioning.
The fault positioning method based on active and passive hybrid measurement is characterized in that the feature extraction module is used for providing fine-granularity packet level features, medium-granularity flow level features and coarse-granularity source and destination level features.
According to the fault positioning method based on the active and passive hybrid measurement, the monitoring switch deployment module further issues the written P4 program to the monitoring node.
According to the fault positioning method based on the active and passive hybrid measurement, the P4 program comprises codes of a large-flow detection module, a fault sensing module and a feature extraction module.
The fault location method based on active and passive hybrid measurement, wherein the fault range deducing module feeds back the source and destination level characteristics to a classifier based on machine learning so as to output a fault range, and the fault location method further comprises the following steps:
in the process of collecting flow statistical information, a classifier based on machine learning adaptively performs incremental training update.
The fault location method based on active and passive hybrid measurement, wherein the fault range deducing module feeds back source and destination level characteristics to a classifier based on machine learning so as to output a fault range, specifically comprises the following steps:
after receiving the warning data packet sent by the fault perception module, the control plane acquires source and destination level characteristics from all monitoring nodes in the topological structure, forms a characteristic matrix and inputs the characteristic matrix into the fault range deducing module;
and the fault range deducing module outputs a link with highest probability as a suspected link to form a potential fault link set.
The fault positioning method based on active and passive hybrid measurement, wherein the fault accurate positioning module monitors a fault range by sending an active detection packet with a special mark so as to realize accurate fault positioning, specifically comprises the following steps:
after the fault range is positioned, the fault accurate positioning module detects the potential fault link set by using an active detection packet with a special mark;
and for the current link of which the fault accurate positioning module does not receive the returned active detection packet, determining that the current link is a fault link.
According to the fault positioning method based on the active and passive hybrid measurement, the number of the active detection packets is equal to the size of the fault range.
The fault locating method based on active and passive hybrid measurement, wherein the packet level features include: TCP source port, TCP destination port, IP header length, service type, number of remaining hops, TCP data offset, TCP congestion window, and packet length;
the traffic level feature and the source destination level feature include: the number of SYN data packets, the number of FIN data packets, the number of retransmission packets, the average value of transmission windows and the maximum interval of data packet arrival.
In addition, to achieve the above object, the present invention further provides a fault locating system based on active-passive hybrid measurement, where the fault locating system based on active-passive hybrid measurement includes:
a data plane and a control plane; the data plane comprises a large-flow detection module, a feature extraction module and a fault perception module; the large-flow detection module is used for filtering small-flow characteristics belonging to noise by using a learning model based on packet level characteristics and reserving the large-flow characteristics;
the feature extraction module is used for extracting source and destination level features according to a time window for each large-flow feature and storing the source and destination level features in a switch buffer area;
the fault perception module is used for checking the flow level characteristics of the large flow characteristics by using a machine learning model so as to detect whether network faults occur, and if the network faults are detected, a warning data packet is sent to the control plane;
the control plane comprises a monitoring switch deployment module, a fault range deducing module and a fault accurate positioning module;
the monitoring switch deployment module is used for obtaining network topology from the topology manager, selecting the position of a monitoring node to be deployed by using a monitor selection algorithm, and deploying the monitoring switch according to the position of the monitoring node;
the fault range deducing module is used for feeding the source and destination level characteristics back to the classifier based on machine learning so as to output a fault range;
the fault accurate positioning module is used for monitoring a fault range by sending an active detection packet with a special mark so as to realize accurate fault positioning.
The fault sensing module is a machine learning model decision tree for judging whether network faults occur or not.
The fault positioning system based on active and passive hybrid measurement, wherein the classifier used in the large-flow detection module and the fault perception module supports line speed processing of incoming flow.
In addition, to achieve the above object, the present invention also provides a terminal, wherein the terminal includes: the fault location system comprises a memory, a processor and a fault location program based on active and passive hybrid measurement, wherein the fault location program based on active and passive hybrid measurement is stored in the memory and can run on the processor, and the fault location program based on active and passive hybrid measurement realizes the steps of the fault location method based on active and passive hybrid measurement when being executed by the processor.
In addition, in order to achieve the above object, the present invention also provides a computer readable storage medium storing a fault location program based on active-passive hybrid measurement, which when executed by a processor, implements the steps of the fault location method based on active-passive hybrid measurement as described above.
According to the invention, the monitoring switch deployment module is responsible for deploying the monitoring switch, the large-flow detection module filters out small-flow data belonging to noise, the fault perception module perceives faults, if faults occur, the data stored in the data plane register can be input into the fault range deducing module to output the fault range, and finally the fault accurate positioning module sends a small amount of detection packets to accurately position in the fault range, so that the intelligent network fault perception and positioning are universal, lightweight, low in cost, rapid and accurate.
Drawings
FIG. 1 is a schematic representation of the principle of a preferred embodiment of the fault localization system of the present invention based on active-passive hybrid measurements;
FIG. 2 is a schematic diagram of the interaction of the data plane and the control plane in a preferred embodiment of the fault localization system based on active-passive hybrid measurements of the present invention;
FIG. 3 is a flow chart of a preferred embodiment of the fault localization method based on active-passive hybrid measurement of the present invention;
FIG. 4 is a code schematic diagram of a monitor selection algorithm in a preferred embodiment of the fault location method based on active-passive hybrid measurement of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The network fault positioning system based on the active and passive hybrid measurement and the intelligent network provided by the invention realizes a general, lightweight, low-overhead, rapid and accurate intelligent network fault sensing and positioning scheme on a general network topology. To implement this mechanism, the following challenges need to be addressed:
(1) The switch resources are limited: on the one hand, determining the appropriate data passive measurement granularity is a challenge, and statistics of fine granularity such as: the feature storage and reporting at the packet level and the flow level can result in switch hardware storage resource overhead and upstream bandwidth overhead. Coarse-grained statistics such as: the switch port counter, while avoiding measurement overhead, does not provide sufficient information for accurate fault localization. On the other hand, deploying a fault sensor based on machine learning in a data plane is also a challenge, and deploying a machine learning model to a switch CPU cannot realize linear velocity processing, resulting in long fault location delay, and deploying a complex model in a network cannot.
(2) Accurate network fault location: faults at different locations in the network may exhibit the same pattern of flow changes and thus accurate fault localization is difficult to achieve by passive measurement alone.
(3) The small stream data belongs to noise for fault identification: on the one hand, since the number of data packets of small flow is small, the duration is short, and it is difficult to capture the time-varying characteristics thereof. On the other hand, both the occurrence of a fault and the normal end of the flow will result in a flow change. Assuming that the number of packets in the two time windows w1 and w2, w1 is 100, and in w2 all traffic ends, this results in a sharp drop in the number of packets, similar to the characteristics of a failure.
(4) Deploying the network monitor in a suitable location: the system needs to deploy monitors at appropriate locations in the network and collect data to characterize the traffic conditions throughout the network. The greater the number of monitors, the more traffic information is obtained, but the overhead collected will be updated accordingly. Therefore, choosing to deploy the monitor in the proper location is also a challenge.
The system of the present invention addresses the above challenges by designing the following 4 points, respectively:
(1) The data reporting scheme triggered by the fault sensing module is designed: the invention designs a fault sensing module based on machine learning on a data plane, and can sense the occurrence of faults at a linear speed. When the fault is perceived to occur in the data plane, the data reporting is performed, so that the data reporting and active detection overhead can be greatly reduced.
(2) The fault accurate positioning scheme based on active and passive hybrid measurement is designed: after the passive measurement locates the fault to the range, the detection packet is sent in combination with the active measurement mode to realize accurate location.
(3) A large flow detection module based on machine learning is designed: the invention designs a large flow detection module on the data plane to monitor large flow at linear speed and count large flow information, thereby avoiding the influence of small flow noise. Meanwhile, in order to utilize the strong fitting capacity of the complex neural network model and the light weight of the simple model, a knowledge distillation mode is adopted to distill the neural network into a decision tree and deploy the decision tree to a data plane.
(4) A monitoring switch deployment module based on greedy ideas is designed: the invention designs a monitor selection algorithm, can deploy less monitoring switches and realize high-precision fault positioning.
The fault locating method based on active-passive hybrid measurement according to the preferred embodiment of the present invention is applied to a fault locating system based on active-passive hybrid measurement, as shown in fig. 1 and 2, where the fault locating system based on active-passive hybrid measurement includes: a data plane 10 and a control plane 20; the data plane 10 comprises a large flow detection module 11, a feature extraction module 12 and a fault perception module 13; the control plane 20 includes a monitoring switch deployment module 21, a fault range inference module 22, and a fault pinpoint module 23.
As shown in fig. 2, specifically, the large-flow detection module 11 is configured to filter out small-flow features belonging to noise using a learning model based on packet-level features, and retain the large-flow features; because the small flow belongs to noise in fault location, the large flow detection module 11 needs to be deployed to filter out the small flow characteristics, and only the large flow characteristics are counted.
As shown in fig. 2, specifically, the feature extraction module 12 is configured to extract, for each large-stream feature, a source destination level feature according to a time window, and store the source destination level feature in a switch buffer; in online mode, the feature extraction module 12 converts unstructured traffic statistics into structured data using designed feature engineering algorithms, where the statistical features include packet level features, stream level features, and source destination level features. I.e. the feature extraction module 12 is used to provide fine-grained packet-level features, medium-grained traffic-level features and coarse-grained source-destination-level features.
As shown in fig. 2, in particular, the fault awareness module 13 is configured to use a machine learning model to check the flow level characteristics of the large flow characteristics to detect whether a network fault has occurred, and if a network fault is detected, send a warning packet to the control plane. The fault sensing module 13 is a machine learning model decision tree for determining whether a network fault occurs, and may implement line speed fault sensing. And the classifiers used in the large flow detection module 11 and the fault awareness module 13 support line speed processing of incoming traffic.
As shown in fig. 2, specifically, the monitoring switch deployment module 21 is configured to obtain a network topology from a topology manager, select a location where a monitoring node needs to be deployed using a monitor selection algorithm, and deploy a monitoring switch (e.g., the monitoring switch in fig. 2) according to the location of the monitoring node. In the offline mode, the monitoring switch deployment module 21 obtains the network topology from the topology manager of the controller, uses the monitor selection algorithm to select the position where the monitoring node needs to be deployed, and then issues the written P4 (protocol-independent data packet processing becomes language) program (including the codes of the feature extraction module 12, the fault sensing module 13 and the large-flow detection module 11) to the monitoring node; the goal is to deploy fewer monitoring switches.
As shown in fig. 2, in particular, the fault scope inference module 22 is configured to feed back the source destination level characteristics to the machine learning based classifier to output a fault scope; the fault range inference module 22 inputs the flow data into a classification model (classifier) based on machine learning, performs fault detection at microsecond time, outputs a fault device range, and adaptively performs incremental training update on the classifier (classification model) based on machine learning during the process of collecting flow statistical information.
As shown in fig. 2, specifically, after the fault range deducing module 22 determines the fault range, the fault accurate positioning module 23 is configured to monitor the fault range by sending an active probe packet with a special mark, so as to implement accurate fault positioning. The monitoring switch deployment module 21 is responsible for deploying the monitoring switch, the large-flow detection module 11 filters out small-flow data belonging to noise, the fault sensing module 13 senses a fault, if the fault occurs, the data stored in the register of the data plane 10 can be input into the fault range deducing module 22 to output a fault range, and finally the fault accurate positioning module 23 sends a small amount of active detection packets to accurately position in the fault range. The components cooperate with each other, so that the cost of periodic reporting of passive measurement data and periodic detection packet of active measurement is reduced, and the fault location with universality, low delay and high precision is realized.
Further, based on the fault locating system based on the active-passive hybrid measurement, the invention also provides a fault locating method based on the active-passive hybrid measurement, as shown in fig. 3, the fault locating method based on the active-passive hybrid measurement comprises the following steps: step S10, the large-flow detection module filters small-flow characteristics belonging to noise by using a learning model based on packet level characteristics, and retains the large-flow characteristics;
step S20, the feature extraction module extracts source destination level features according to a time window for each large-flow feature and stores the source destination level features in a switch buffer area;
step S30, the fault sensing module uses a machine learning model to check the flow level characteristics of the large flow characteristics so as to detect whether network faults occur, and if the network faults are detected, a warning data packet is sent to the control plane;
step S40, the monitoring switch deployment module obtains network topology from a topology manager, selects the position of a monitoring node to be deployed by using a monitor selection algorithm, and deploys the monitoring switch according to the position of the monitoring node;
step S50, the fault range deducing module feeds the source and destination level characteristics back to a classifier based on machine learning so as to output a fault range;
and step 60, the fault accurate positioning module monitors the fault range by sending an active detection packet with a special mark so as to realize accurate fault positioning.
In the data plane, the large flow detection module firstly filters out small flow characteristics by using a learning model based on packet level characteristics, and then, for each large flow characteristic, the characteristic extraction module extracts source and destination level characteristics according to a time window and stores the source and destination level characteristics in a switch buffer area; next, the fault awareness module checks flow level characteristics of the flow using a machine learning model to determine if a network fault has occurred; if a network failure is detected, sending a warning data packet to the control plane; after receiving the warning data packet from the data plane, the controller adopts a two-stage positioning strategy to effectively position the fault link; first, the fault range deducing module feeds back the source and destination level features to a machine learning-based classifier (classification model) to output a fault equipment range; then, in order to accurately determine the fault position, the fault accurate positioning module detects a suspected fault link by using an active detection packet with a special mark; if the returned active detection packet is not received, an exact fault link can be found; the two-stage positioning strategy can greatly reduce the number of the detection packets for transmitting fault positioning purposes, thereby effectively reducing the burden on bandwidth resources.
In particular, to achieve efficient traffic monitoring, the number of monitors (monitoring switches) and their location in the network need to be carefully considered. On the one hand, the greater the number of monitors, the more traffic information can be captured to facilitate more accurate fault localization. On the other hand, an increase in the number of monitors not only creates additional deployment costs, but also increases data collection time, bandwidth overhead, and controller pressure. Therefore, a trade-off must be made between the level of detail in collecting information and the number of monitors.
The monitoring switch deployment module on the control plane uses a monitor selection algorithm whose code is shown in fig. 4 to select monitors, one monitor only being able to monitor the end-to-end traffic information of the shortest path through its source-destination pair. The algorithm first establishes a shortest path set P for all traffic using the florid algorithm, and then, in order to cover as much traffic as possible, the algorithm iteratively selects monitors such that in each iteration the node providing the greatest information gain SCORE among all nodes is selected as monitor, the information gain of nodei (node i) is defined as:
SCORE(i)=|S i -S i ∩C|;
wherein S is i Representing the shortest path set through nodeiC represents the shortest path that has been covered by the selected monitor in the previous iteration. After one monitor is selected, the path it covers will be removed from the shortest path set P. The algorithm ends when all paths are covered, or when the number of monitors reaches the required number K. Other algorithms may also be implemented by the network administrator, or a full deployment (all switches are monitors) may be used directly.
The invention adopts a dynamic deployment strategy, and the control plane firstly loads P4 programs of all switches in an offline mode. At this point, the program is in a closed state to avoid consuming the resources of the switch. The monitor selector in the controller then dynamically selects the monitor, sends the configuration file (including the P4 program, time window, and other parameters) to the data plane, and turns on the fault location function of the selected monitor, such a dynamic deployment policy ensures that monitoring is only activated on the switch if traffic passes. Thus, resource consumption during idle periods can be avoided.
The feature extraction module processes traffic and extracts various granularity-measured features, in particular it provides fine granularity packet level features and medium granularity traffic level features for use by the large flow detection module and the fault perception module, respectively, in the data plane; the feature extraction module also generates coarse-grained source-destination level features that are reported to the control plane when a fault is detected; source-destination level feature reporting significantly reduces communication overhead compared to flow-level features and packet-level features.
Table 1 lists the features contained in the packet level features, the stream level features, and the source destination level features, including: TCP source port, TCP destination port, IP header length, service type, number of remaining hops, TCP data offset, TCP congestion window, and packet length; the traffic level feature and the source destination level feature include: SYN data package number, FIN data package number, retransmission package number, sending window average value and data package arrival maximum interval; the stream level features and the source destination level features are extracted over a time span of a time window, and the system performs full network time synchronization at initialization in order to ensure consistency among switches when extracting features.
Figure GDA0004196563470000061
Table 1: description of the features
A sudden decrease in the number of packets is observed both at the time of failure and at the normal end of the traffic. However, unlike normal traffic ends, failures can also cause other changes in traffic patterns, such as an increase in the number of TCP retransmitted packets, a decrease in the transmission window, and an extension in the packet arrival interval. Thus, to distinguish faults from normal traffic ends, the relevant characteristics of SYN and FIN packets are added to the flow level and topology source destination level characteristics. The fault accurate location module considers the following features (table 1 gives a description of all the features): SYN_count (denoted S), FIN_count (denoted F), packet_count (denoted P), ret_count (denoted R), cwnd_mean (denoted C), gap_max (denoted G). It is noted that all packets, whether they belong to large or small traffic, are considered when computing the traffic level features, whereas only large traffic packets are considered when computing the source destination level features of the topology.
The three levels of flow characteristics are constructed as follows:
packet level features: the feature extractor extracts the data packet header information and constructs a packet-level feature vector X, specifically as follows:
X=[srcPort,dsttPort,ip_ihl,ip_ToS,ip_TTL,TCP_dataofs,TCO_window,length];
stream level characteristics: within the time window, the stream level feature matrix identified by the quintuple is constructed as:
Figure GDA0004196563470000071
where Sw represents syn_count in the w window, and similarly Fw, pw, rw, cw, gw.
Source destination level feature: in a time window, the monitoring node only considers the large-flow data packet, collects the flow characteristics of all source and destination pairs, and constructs a characteristic matrix as follows:
Figure GDA0004196563470000072
where the superscript pair i, j identifies the source destination pair from node i to node j, and V is the number of topology nodes.
For each monitor, its source-destination level characteristics M over W time windows can be obtained by chronologically concatenating the matrices T:
M=[T 1 ,…T w ,…,T W ];
the source destination level feature V of all monitors over W time windows consists of a concatenation of features M of all monitors within the topology:
V=[M 1 ,…M k ,…,M K ];
where K represents the number of monitors.
The fault sensing module uses the flow level characteristics collected by the characteristic extraction module in the past time window as input to identify whether a fault occurs, and when the fault sensing module identifies the fault, the fault sensing module sends a warning data packet to the control plane, so that the controller can further collect the characteristics and accurately position the fault.
The large flow detection module filters out small flows to avoid interference caused by the end of such flows in the fault detection process, takes a packet-level feature vector X as input and identifies whether a flow is large or small, and if the classification result indicates that the data packet belongs to large flows, the feature extraction module processes the data packet to update the source destination level feature in the ring buffer (switch buffer) in the switch.
The classifier used in the large flow detection module and the failure awareness module must support line speed processing of incoming traffic, which can only be achieved through intra-network computation of programmable switches. While powerful classifiers, such as neural networks, can provide high accuracy, their deployment on programmable switches is not feasible due to limited operation supported by programmable switches. After receiving the warning data packet from the switch, the control plane firstly acquires source-destination level characteristics from all monitoring nodes in the topological structure, forms a characteristic matrix and inputs the characteristic matrix into the fault range deducing module, and some link faults can cause similar flow change modes and are difficult to distinguish correctly. Therefore, the fault range inference module outputs the link with the highest probability as a suspected link to form a potential fault link set. The invention adopts XGBoost, and a tree-based classification model is used for fault range positioning, and the reasons are as follows:
(1) Light weight: XGBoost, which is a tree-based algorithm, can effectively perform fault localization in a negligible amount of time due to its smaller size and lower computational cost.
(2) The performance is good: XGBoost may represent a very complex strategy with performance superior to other models.
(3) The interpretability is strong: the tree-based model XGBoost has strong verifiability and robustness, and the strong interpretability is helpful for understanding of network administrators.
After locating the fault range, the system needs to detect whether the devices in the fault range have problems, the fault accurate locating module deployed in the controller (control plane) sends specially marked active detection packets (detection data packets) to detect suspicious devices at the same time, each detection data packet starts from the controller, passes through a possibly faulty device set and finally returns to the controller, if the controller does not receive the returned data packets, the link is determined to be faulty, and the number of the detection data packets is equal to the size of the fault range.
Furthermore, based on the fault locating method and system based on the active-passive hybrid measurement, the invention further correspondingly provides a terminal, and the terminal comprises a processor, a memory and a display.
The memory may in some embodiments be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory may in other embodiments also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal. Further, the memory may also include both an internal storage unit and an external storage device of the terminal. The memory is used for storing application software and various data installed on the terminal, such as program codes of the installation terminal. The memory may also be used to temporarily store data that has been output or is to be output. In an embodiment, a fault locating program based on the active-passive hybrid measurement is stored in the memory, and the fault locating program based on the active-passive hybrid measurement can be executed by the processor, so that the fault locating method based on the active-passive hybrid measurement is realized.
The processor may in some embodiments be a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory, for example performing the active-passive hybrid measurement based fault localization method or the like.
The display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like in some embodiments. The display is used for displaying information on the terminal and displaying a visual user interface. The components of the terminal communicate with each other via a system bus.
In an embodiment, the step of fault localization based on active-passive hybrid measurements as described above is implemented when the processor executes a fault localization program based on active-passive hybrid measurements in said memory.
Further, the present invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a fault location program based on active-passive hybrid measurement, and the fault location program based on active-passive hybrid measurement implements the steps of the fault location method based on active-passive hybrid measurement as described above when executed by a processor.
In summary, the technical scheme of the invention has the following beneficial effects:
(1) The data reporting scheme triggered by the fault sensing module is designed: the fault occurrence in the network is not frequent, continuous data reporting and detection are unnecessary, and when the fault occurrence is perceived in the data plane, the data reporting and active detection cost can be greatly reduced by reporting the data, so that the scheme designs a fault perception module based on machine learning in the data plane, and the fault occurrence can be perceived at the linear speed.
(2) The fault accurate positioning scheme based on active and passive hybrid measurement is designed: the method has the advantages that the source and destination level characteristics are collected, the cost of collecting the flow level characteristics and the packet level characteristics is reduced, after the passive measurement locates the fault to a range, an active detection packet is sent in a combined active measurement mode to achieve accurate location, and meanwhile, in order to utilize the strong fitting capacity of a complex neural network model and the light weight of a simple model, a knowledge distillation mode is adopted to distill the neural network into a decision tree and deploy the decision tree to a data plane.
(3) A monitoring switch deployment module based on greedy ideas is designed: in order to realize the trade-off between the number of monitoring nodes and the captured flow information, the monitor selection algorithm is designed, fewer monitoring nodes can be deployed, and high-precision fault positioning is realized.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal comprising the element.
It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Claims (12)

1. The fault locating method based on the active-passive hybrid measurement is characterized by being applied to a fault locating system based on the active-passive hybrid measurement, and the fault locating system based on the active-passive hybrid measurement comprises the following steps: a data plane and a control plane; the data plane comprises a large-flow detection module, a feature extraction module and a fault perception module; the control plane comprises a monitoring switch deployment module, a fault range deducing module and a fault accurate positioning module; the fault locating method based on the active and passive hybrid measurement comprises the following steps:
the large-flow detection module filters small-flow characteristics belonging to noise by using a learning model based on packet level characteristics, and retains the large-flow characteristics;
the feature extraction module extracts source and destination level features according to a time window for each large-flow feature and stores the source and destination level features in a switch buffer area;
the fault perception module uses a machine learning model to check the flow level characteristics of the large flow characteristics so as to detect whether network faults occur, and if the network faults are detected, a warning data packet is sent to the control plane;
the monitoring switch deployment module obtains network topology from the topology manager, selects the position of a monitoring node to be deployed by using a monitor selection algorithm, and deploys the monitoring switch according to the position of the monitoring node;
the fault range deducing module feeds the source and destination level characteristics back to the classifier based on machine learning so as to output a fault range;
the fault range inference module feeds back the source and destination level characteristics to the classifier based on machine learning so as to output a fault range, and specifically comprises the following steps:
after receiving the warning data packet sent by the fault perception module, the control plane acquires source and destination level characteristics from all monitoring nodes in the topological structure, forms a characteristic matrix and inputs the characteristic matrix into the fault range deducing module;
the fault range deducing module outputs a link with highest probability as a suspected link to form a potential fault link set;
the fault accurate positioning module monitors a fault range by sending an active detection packet with a special mark so as to realize accurate fault positioning;
the fault accurate positioning module monitors a fault range by sending an active detection packet with a special mark so as to realize accurate fault positioning, and specifically comprises the following steps:
after the fault range is positioned, the fault accurate positioning module detects the potential fault link set by using an active detection packet with a special mark;
and for the current link of which the fault accurate positioning module does not receive the returned active detection packet, determining that the current link is a fault link.
2. The method of claim 1, wherein the feature extraction module is configured to provide a fine-grained packet-level feature, a medium-grained traffic-level feature, and a coarse-grained source-destination-level feature.
3. The method for fault location based on active-passive hybrid measurement of claim 1, wherein the monitoring switch deployment module further issues a written P4 program to a monitoring node.
4. The method for fault location based on active-passive hybrid measurement of claim 3 wherein the P4 program comprises code of a large flow detection module, a fault perception module and a feature extraction module.
5. The method of claim 1, wherein the fault scope inference module feeds back source destination level characteristics to a machine learning based classifier to output a fault scope, further comprising:
in the process of collecting flow statistical information, a classifier based on machine learning adaptively performs incremental training update.
6. The method for fault location based on active-passive hybrid measurement of claim 1, wherein the number of active probe packets is equal to the size of the fault range.
7. The active-passive hybrid measurement based fault location method of claim 2, wherein the packet level features comprise: TCP source port, TCP destination port, IP header length, service type, number of remaining hops, TCP data offset, TCP congestion window, and packet length;
the traffic level feature and the source destination level feature include: the number of SYN data packets, the number of FIN data packets, the number of retransmission packets, the average value of transmission windows and the maximum interval of data packet arrival.
8. A fault location system based on active-passive hybrid measurements, the system comprising: a data plane and a control plane; the data plane comprises a large-flow detection module, a feature extraction module and a fault perception module;
the large-flow detection module is used for filtering small-flow characteristics belonging to noise by using a learning model based on packet level characteristics and reserving the large-flow characteristics;
the feature extraction module is used for extracting source and destination level features according to a time window for each large-flow feature and storing the source and destination level features in a switch buffer area;
the fault perception module is used for checking the flow level characteristics of the large flow characteristics by using a machine learning model so as to detect whether network faults occur, and if the network faults are detected, a warning data packet is sent to the control plane;
the control plane comprises a monitoring switch deployment module, a fault range deducing module and a fault accurate positioning module;
the monitoring switch deployment module is used for obtaining network topology from the topology manager, selecting the position of a monitoring node to be deployed by using a monitor selection algorithm, and deploying the monitoring switch according to the position of the monitoring node;
the fault range deducing module is used for feeding the source and destination level characteristics back to the classifier based on machine learning so as to output a fault range;
the fault range deducing module is used for feeding back the source and destination level characteristics to the classifier based on machine learning so as to output a fault range, and specifically comprises the following steps:
after receiving the warning data packet sent by the fault perception module, the control plane acquires source and destination level characteristics from all monitoring nodes in the topological structure, forms a characteristic matrix and inputs the characteristic matrix into the fault range deducing module;
the fault range deducing module outputs a link with highest probability as a suspected link to form a potential fault link set;
the fault accurate positioning module is used for monitoring a fault range by sending an active detection packet with a special mark so as to realize accurate fault positioning;
the fault accurate positioning module is used for monitoring a fault range by sending an active detection packet with a special mark so as to realize accurate fault positioning, and specifically comprises the following steps:
after the fault range is positioned, the fault accurate positioning module detects the potential fault link set by using an active detection packet with a special mark;
and for the current link of which the fault accurate positioning module does not receive the returned active detection packet, determining that the current link is a fault link.
9. The active-passive hybrid measurement based fault location system of claim 8, wherein the fault awareness module is a machine learning model decision tree for determining whether a network fault has occurred.
10. The active-passive hybrid measurement based fault location system of claim 8, wherein classifiers used in the large flow detection module and the fault awareness module support line speed processing of incoming traffic.
11. A terminal, the terminal comprising: memory, a processor and an active-passive hybrid measurement based fault location program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the active-passive hybrid measurement based fault location method as claimed in any one of claims 1 to 7.
12. A computer readable storage medium, characterized in that it stores a fault localization program based on active-passive hybrid measurements, which when executed by a processor implements the steps of the active-passive hybrid measurement based fault localization method according to any of claims 1-7.
CN202210466554.5A 2022-04-29 2022-04-29 Fault positioning method based on active and passive hybrid measurement and related equipment Active CN114900426B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210466554.5A CN114900426B (en) 2022-04-29 2022-04-29 Fault positioning method based on active and passive hybrid measurement and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210466554.5A CN114900426B (en) 2022-04-29 2022-04-29 Fault positioning method based on active and passive hybrid measurement and related equipment

Publications (2)

Publication Number Publication Date
CN114900426A CN114900426A (en) 2022-08-12
CN114900426B true CN114900426B (en) 2023-06-06

Family

ID=82719866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210466554.5A Active CN114900426B (en) 2022-04-29 2022-04-29 Fault positioning method based on active and passive hybrid measurement and related equipment

Country Status (1)

Country Link
CN (1) CN114900426B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1054371A (en) * 1996-08-12 1998-02-24 Hitachi Constr Mach Co Ltd Trouble diagnostic device for oil hydraulic pump in work machine
US6055851A (en) * 1996-08-12 2000-05-02 Hitachi Construction Machinery Co., Ltd. Apparatus for diagnosing failure of hydraulic pump for work machine
CN205560116U (en) * 2016-05-09 2016-09-07 唐山曹妃甸热力有限公司 Non - digging type underground large diameter pipe restores structure
CN106493271A (en) * 2015-09-07 2017-03-15 刘勇 A kind of accurate gear rolling mechanical processing machine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1054371A (en) * 1996-08-12 1998-02-24 Hitachi Constr Mach Co Ltd Trouble diagnostic device for oil hydraulic pump in work machine
US6055851A (en) * 1996-08-12 2000-05-02 Hitachi Construction Machinery Co., Ltd. Apparatus for diagnosing failure of hydraulic pump for work machine
CN106493271A (en) * 2015-09-07 2017-03-15 刘勇 A kind of accurate gear rolling mechanical processing machine
CN205560116U (en) * 2016-05-09 2016-09-07 唐山曹妃甸热力有限公司 Non - digging type underground large diameter pipe restores structure

Also Published As

Publication number Publication date
CN114900426A (en) 2022-08-12

Similar Documents

Publication Publication Date Title
US11818025B2 (en) Methods, systems, and apparatus to generate information transmission performance alerts
US11671342B2 (en) Link fault isolation using latencies
US7889666B1 (en) Scalable and robust troubleshooting framework for VPN backbones
CN110178342B (en) Scalable application level monitoring of SDN networks
Su et al. CeMon: A cost-effective flow monitoring system in software defined networks
EP2288086B1 (en) Network monitoring device, bus system monitoring device, method and program
US9847922B2 (en) System and method for continuous measurement of transit latency in individual data switches and multi-device topologies
EP1906591B1 (en) Method, device, and system for detecting layer 2 loop
JP5666685B2 (en) Failure analysis apparatus, system thereof, and method thereof
US9225616B2 (en) Feedback-based tuning of control plane traffic by proactive user traffic observation
CN101383737B (en) Method and system for link quality detection based on link layer discovery protocol
US20110229126A1 (en) Automatic Adjustment of Optical Bandwidth Based on Client Layer Needs
CN108449210B (en) Network routing fault monitoring system
JP4510751B2 (en) Network failure detection device
CN108683602B (en) Data center network load balancing method
CN114900426B (en) Fault positioning method based on active and passive hybrid measurement and related equipment
JP2002164890A (en) Diagnostic apparatus for network
JP2008079138A (en) Communication monitoring system, flow collection apparatus, analysis manager apparatus, and program
CN112910795A (en) Edge load balancing method and system based on many sources
Tavernier Experimental evaluation of the machine learning engine
Huang et al. A Passive Mode QoS Measurer for ISP
Deng et al. Diagnosing Spatio-Temporal Internet Congestion Properties
Kumar et al. ANALYSIS OF DATA SECURITY FRAMEWORK FOR WIRELESS SENSOR NETWORKS USING NETWORK SIMULATOR

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant