CN113454950A - Network equipment and link real-time fault detection method and system based on flow statistics - Google Patents

Network equipment and link real-time fault detection method and system based on flow statistics Download PDF

Info

Publication number
CN113454950A
CN113454950A CN201980092647.2A CN201980092647A CN113454950A CN 113454950 A CN113454950 A CN 113454950A CN 201980092647 A CN201980092647 A CN 201980092647A CN 113454950 A CN113454950 A CN 113454950A
Authority
CN
China
Prior art keywords
baseline data
traffic
network
data set
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980092647.2A
Other languages
Chinese (zh)
Inventor
赵石
林跃华
许辉
佘敦成
王淼
刘辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Publication of CN113454950A publication Critical patent/CN113454950A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0876Network utilisation, e.g. volume of load or congestion level
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0681Configuration of triggering conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Physics & Mathematics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A system and method for real-time fault detection of network devices or network links based on traffic statistics is provided. In the aspect of equipment fault detection, a statistical empirical model of network traffic is constructed according to a baseline data set, wherein each baseline data in the baseline data set corresponds to the network traffic accumulated in each interval. In the aspect of link failure detection, a statistical empirical model of link traffic distribution is constructed according to a baseline data set, wherein each baseline data in the baseline data set corresponds to link traffic distribution of each interval. In both cases, the model is dynamically updated with new, qualified, selected data after initial build. And evaluating each new baseline data according to the updated model to judge whether the new baseline data are abnormal values. Successive outliers may trigger a malfunction alarm.

Description

Network equipment and link real-time fault detection method and system based on flow statistics
Technical Field
The disclosed embodiments relate to the field of communication networks, and in particular, to the field of network failure detection mechanisms.
Background
A communication network consists of links and nodes arranged in a certain topology for transporting internet traffic. The nodes include network devices, such as servers, switches, and routers, that are interconnected by links. Existing commercial network fault detection typically relies on user-defined alarms and rule conflicts based on measurement metrics, which require detailed knowledge of hardware characteristics and performance and software elements in the network infrastructure. For a single device or a simple network, fault detection is well understood and easy to implement.
In recent years, however, network architectures have become increasingly complex due to the exponential growth in the number of network devices and links, the large number of device manufacturers, the wide variety of software versions running on the network devices, and the multi-level switching employed in the architecture. Therefore, it is almost impossible to set a fault detection rule that can contain all possible faults in a quick response. The complexity of the network further comes from some unobservable interactions between the devices. For example, two devices are not directly connected, but have an indirect path connecting them. As a result, it is a great challenge to define valid rules that can lead to fast and reliable fault detection. In addition, as new devices or new software versions are introduced, static user-defined rules for fault detection may quickly become obsolete.
Disclosure of Invention
Embodiments of the present disclosure are directed to systems and methods for real-time network fault detection in which traffic anomalies for network components are discovered using dynamic statistics of traffic data without the need to identify detailed characteristics and business operations of the monitored components.
In one aspect, the disclosed embodiments provide an anomaly detection mechanism for a single switching device that periodically evaluates network traffic for a device based on dynamically updated statistics, the network traffic corresponding to a difference between ingress traffic and egress traffic for the device.
In particular, for network devices, statistical empirical models of network traffic may be constructed according to a machine learning process. In some embodiments, the model is initially built using network traffic data collected at the device over a plurality of intervals, such as consecutive intervals. For example, each baseline data of the baseline dataset of the model corresponds to the network traffic accumulated for each interval. For example, the model includes a function of the mean and standard deviation of each interval of network traffic. After the initial model is built, for each interval, a determination is made as to whether a new baseline data for network traffic is eligible and is selected for updating the model. If so, the new baseline data replaces the oldest baseline data in the baseline data set, and the model is recalculated. Whether or not the new baseline data is used to update the model, the new baseline data is evaluated against the updated model to determine whether it is an outlier. In response to detecting a preset number of consecutive outliers, an alarm is generated that may trigger further automatic or manual diagnostics, troubleshooting, and repair actions.
Generally, a link comprises a set of parallel links that share the traffic load between two sides of the link, each side comprising one or more devices. The links are functionally equivalent and the total flow between the two sides in the absence of a fault can be distributed over the links at a stable duty cycle. If one link fails, the other links can automatically take over the traffic load that the failed link cannot complete, and therefore the traffic distribution (herein, link traffic distribution) between the links changes. In another aspect of the disclosure, embodiments provide a link anomaly detection mechanism that periodically compares the real-time link traffic distribution in the link to a dynamically updated statistical empirical model. In some embodiments, the model includes an expected link traffic distribution.
The expected link traffic distribution may be obtained by averaging a baseline data set of link traffic distribution data collected over a plurality of intervals, such as consecutive intervals. For example, each baseline data of the baseline data set corresponds to a set of traffic fractions that the link shares within an interval. After the model is initially built, for each interval, a determination is made as to whether a new baseline data set linking traffic distributions is eligible and is selected for updating the baseline data set. If so, the new baseline data replaces the oldest baseline data of the baseline data set to update the expected link traffic distribution. Whether or not the new baseline data is used to update the model, the new baseline data is evaluated against the expected distribution to determine whether the new baseline data is an outlier. In response to detecting a preset number of consecutive outliers, an alarm is generated that may further trigger automatic or manual diagnostics, troubleshooting, and repair actions.
According to embodiments of the present disclosure, since network traffic or link traffic distribution is continuously monitored and evaluated in real-time using simple statistical processing, network anomalies of devices or links can be conveniently captured in a quick response, regardless of the complexity of the network architecture. Since the monitored statistics are simply derived from the traffic data, fault detection can be conveniently achieved by using the existing data of the equipment or link, and the empirical model does not require comprehensive knowledge of the detailed characteristics and business operations.
In addition, because the data is frequently updated along with the collection of new data, the statistical model can reflect the latest data probability distribution, which conveniently enhances the effect and accuracy of fault detection. Further, since the model of the device or link can be constructed and updated using actual empirical data of the monitored device or monitored link itself, the model is formulated based on the characteristics and business operations of the device or link. This further contributes to the accuracy of the fault detection.
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present disclosure, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.
Drawings
The embodiments of the disclosure will be best understood from the following detailed description when read with the accompanying drawings in which like characters represent like elements.
Fig. 1 illustrates an exemplary communication network having a fault detection device capable of detecting device faults and link faults in real time based on statistics of traffic data according to an embodiment of the present disclosure.
FIG. 2 is a flow diagram of an exemplary computer-implemented process for real-time device fault detection based on traffic statistics, according to an embodiment of the present disclosure.
FIG. 3 is a flow diagram of an exemplary computer-implemented process for statistical model building and corresponding fault detection for a device according to an embodiment of the present disclosure.
Fig. 4 illustrates a change in link traffic distribution of an exemplary link after a link failure thereof.
Fig. 5 is a flow diagram of an exemplary computer-implemented process for real-time link failure detection based on traffic statistics, according to an embodiment of the present disclosure.
FIG. 6 is a flow diagram of an exemplary computer-implemented process for statistical model building of links and corresponding fault detection according to an embodiment of the present disclosure.
FIG. 7 is a block diagram of an exemplary computing system for real-time device failure detection and link detection based on traffic statistics, according to an embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the preferred embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. The invention will be described in conjunction with the preferred embodiments, it will be understood that they are not intended to limit the invention to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which may be included within the scope and spirit of the invention as defined by the appended claims. Furthermore, in the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components, and links have not been described in detail as not to unnecessarily obscure aspects of the embodiments of the invention. Although a method may be described as a series of steps for clarity, the step numbering does not necessarily imply a sequence of steps. It should be understood that some steps may be skipped, performed in parallel, or performed without the requirement of maintaining a strict order of sequence. The drawings showing embodiments of the invention are semi-diagrammatic and not to scale and, particularly, some of the dimensions are for the clarity of presentation and are shown exaggerated in the drawing figs. Likewise, although the views in the drawings for ease of description generally show the same orientation, this description in the drawings is arbitrary for the most part. In general, the invention can be implemented in any orientation.
Symbols and terms
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless otherwise specifically apparent from the following discussion, it should be understood that terms such as "collecting," "constructing," "processing," or "calculating" or "executing" or "storing" or the like, are used throughout the present disclosure to mean: the actions and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories and other computer readable media into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. When a component appears in several embodiments, the same reference numerals are used to indicate that the component is the same as that shown in the initial embodiment.
Network equipment and link real-time fault detection based on flow statistics
Embodiments of the present disclosure provide mechanisms for detecting network device or link failures based on real-time traffic data and its statistics. For network devices, an empirical statistical model can be constructed using a baseline data set collected over a plurality of intervals, the model representing a probability distribution of network traffic for the device over each interval. The model may include a set of statistical indicators or correlation functions, e.g., the indicators are mean and standard deviation. After the initial model is built, a new baseline data of the network flow of each interval is evaluated according to the model so as to judge whether the baseline data of the interval is an abnormal value. The continuous occurrence of outliers can trigger a malfunction alarm. If qualified, the new baseline data can be randomly selected for updating the baseline model. In this embodiment, the model is updated with the most recent normal data, thus accurately reflecting the current characteristics and business operations of the equipment.
For a link, an empirical statistical model is constructed using a baseline data set collected over a plurality of intervals, the model representing a probability distribution of link traffic distribution for each interval within the link. The model may correspond to an expected link traffic distribution. After the initial model is built, for each interval, a new baseline data comprising a set of link traffic or link traffic distribution is evaluated according to the model to determine whether the baseline data of the interval is an abnormal value. The successive occurrence of outliers can trigger a false alarm. If qualified, the baseline data can be randomly selected for updating the baseline model. In this embodiment, the model is updated with the most recent normal data, thus accurately reflecting the current characteristics and business operations of the equipment.
Fig. 1 illustrates an exemplary communication network 100 having fault detection devices 121 and 122 capable of detecting device faults and link faults in real time based on traffic data statistics in accordance with an embodiment of the present disclosure. In a simplified form, the network 100 includes a plurality of network switching devices (e.g., routers) interconnected and arranged in multiple layers, each configured to forward network traffic. The switching device belongs to a network architecture controlled by the internet service provider 110. A terminal (e.g., 131) is coupled to the switching device, which may be a server device or a client device. It is to be understood that the present disclosure is not limited to any particular type of network topology or switching device.
Each switching device may be configured to collect various forms of traffic data, for example, in compliance with Simple Network Management Protocol (SNMP). In accordance with the present disclosure, the real-time traffic data can be used to build dynamically updated statistical models for real-time fault detection of devices and links. In an exemplary embodiment, the model building and fault detection functions may be implemented in a separate monitoring device (e.g., device 141 or 142) that is coupled to the monitored (e.g., switching device 121 or 122). However, in some other embodiments, the fault detection function may be integrated in the switching device.
As shown, during operation of the service, the switching device 122 periodically collects the ingress and egress traffic thereof for providing to the monitoring device 142. Assuming that a significant change in network traffic over a short period of time, corresponding to the difference between the total ingress traffic and the total egress traffic, can indicate an anomaly or failure of the switching device. The monitoring device 142 constructs a statistical empirical model of network traffic based on the baseline data set provided by the switching device 122. The model represents the probability distribution of network traffic in each interval, and a normal area and an abnormal area are defined according to the model. In some embodiments, the model is as simple as including the mean and standard deviation of the baseline data set. However, the present disclosure is not limited to any particular statistical indicator, function, algorithm, or formula associated with network traffic used in the statistical model. For each interval, new baseline data for the network traffic is evaluated according to the model to determine whether it falls within an abnormal region. Additionally, a new baseline data that is eligible can be selected for updating the model. If successive outliers are detected, an alarm is generated to trigger subsequent manual or automatic troubleshooting measures.
Further, suppose that in a link, a significant change in link traffic distribution over a short period of time may indicate a link anomaly or failure. As shown, switching devices 121 and 123 and several links 151 therebetween are configured as one link. Traffic between the switching devices 121 and 123 is distributed over the links 151 in a set of specific ratios. The switching device 123 periodically collects the total ingress traffic or the total egress traffic of each link and provides the collected total ingress traffic or total egress traffic to the monitoring device 141. The monitoring device 141 constructs a statistical empirical model from the baseline data set provided by the switching device 123. The model represents an expected link traffic distribution for the link. And defining a normal area and an abnormal area according to the model. The present disclosure is not limited to any particular statistical indicator, function, algorithm, or formula associated with the link traffic used in the statistical model. For each interval, a new baseline data comprising a set of link traffic or current link traffic distribution is evaluated against the expected distribution to determine whether the new baseline data falls within an abnormal region. Additionally, a new baseline data that is eligible can be selected for updating the model. If successive outliers are detected, an alarm is generated to trigger subsequent manual or automatic troubleshooting measures.
According to the embodiments of the present disclosure, since network traffic or link traffic distribution is continuously monitored and evaluated in real time using simple statistical processing, network anomalies of devices or links can be conveniently captured in a quick response even if the network architecture is complex. Since the statistical indicators to be monitored can be derived from the traffic, fault detection can be conveniently implemented using off-the-shelf data and empirical models of the equipment or link, without requiring comprehensive knowledge of its complex detailed characteristics, performance, and service operation.
In addition, because the collected new data is frequently used for updating, the model reflects the latest data probability distribution, and the effectiveness and the accuracy of fault detection can be obviously enhanced. Further, because the model is constructed and updated using empirical data collected from a particular device or link, the model is also developed from the monitored device or monitored link. This further contributes to the accuracy of the fault detection.
Fig. 2 is a flow diagram of an exemplary computer-implemented process 200 for real-time device fault detection based on traffic statistics, according to an embodiment of the present disclosure. Implementation process 200 may be performed by a monitoring device interactively coupled to a switching device being monitored or a monitoring module integrated with a switching device being monitored. At 201, a statistical empirical model of network traffic per interval is generated from the initialized baseline data set. For example, the baseline data set includes N consecutive intervals of network traffic data, e.g., 1 minute each, with N being 2000. The specific numbers herein are exemplary only, and the disclosure is not limited thereto. The span of intervals and sample capacity may be selected based on considerations such as data acquisition noise due to various engineering constraints, statistical properties of the traffic distribution, and sufficient representativeness of the probability distribution.
Each baseline data in the baseline data set is per-interval network traffic corresponding to a difference between total ingress traffic and total egress traffic accumulated over an interval. The total inlet and outlet flows may be the sum of the respective flows through all inlet and outlet ports of the apparatus, respectively. The inlet and outlet flow data may be collected in real time by the monitoring device and provided to the monitoring device or monitoring module for fault detection purposes.
And defining a normal area and one or more abnormal areas according to the probability distribution of the network traffic data in the N intervals. In one example, real-world network traffic data may follow a normal probability distribution; however, the present disclosure is not limited thereto. In some embodiments, the statistical model relates to a mean and a standard deviation of the baseline data set, and the abnormal region and the normal region can be defined as a function of the mean and the standard deviation, as described in more detail below with respect to fig. 3.
At 202, network traffic data for the device is generated periodically, e.g., every minute, in the same manner as the baseline data set is generated by 201. At 203, the statistical model is updated in real-time with new network traffic data while maintaining the data volume of the baseline data set. At 204, each new network traffic baseline data is evaluated according to the updated statistical model to determine whether it is located within the abnormal region. At 205, if M outliers occur in succession, an alarm is generated that may trigger various further operations, such as troubleshooting, diagnostic operations, and the like. For example, M is predefined as 3.
FIG. 3 is a flow diagram of an exemplary computer-implemented process 300 for statistical model building and corresponding fault detection for a device according to an embodiment of the present disclosure. At 301, a span index "i" is set to 1. At 302, interval TiNetwork traffic baseline data DiAnd determining according to the real-time inlet flow and the real-time outlet flow accumulated in the interval detected in real time. At 303, D is determinediQualified or not, can be used as a baseline data for the statistical empirical model. In some embodiments, a baseline datum is qualified if the following conditions are met: (1) the total inlet traffic and the total outlet traffic in the interval are both greater than a certain value, for example 1Mbit/s (BPS, megabits per second); and, (2) the previous baseline data (i-1), which is a normal value, as described below. However, various other qualifying conditions may be employed without departing from the scope of the present disclosure. If the baseline data is not qualified baseline data, the index i is incremented at 311 to evaluate the next baseline data.
For a qualifying baseline data, it is determined whether to add it to the baseline data set of the statistical model. In particular, at 304, it is determined whether the current baseline data set is less than 2000 data. If so, at 305, the new baseline data DiIs added to the initial for the statistical modelThe baseline data set is constructed, for example, by taking the mean and standard deviation of the baseline data set. In some embodiments, the average (m) is calculated as follows:
mean=average(log(D1),…,log(Di),…,log(DN)),
wherein, N is 2000; the standard deviation (sd) is calculated as follows:
sd=sd(log(D1),…,log(Di),…,log(DN))。
it is understood that various other forms or equations of mean or standard deviation, or other statistical indicators, may be employed without departing from the scope of this disclosure. Once D is used 305iThe model is updated, and index i is incremented at 311 to evaluate the next baseline data.
If the current baseline data set has reached 2000 (as determined in 304), DiDirectly, further decision D is made at 306iWhether an outlier is present. For example, if (D)i-mean)/ad>3, then define DiIs an abnormal value. If D isiNon-abnormal value, DiMerging into the baseline data set and updating 307 the oldest baseline data in the baseline data set; the mean and standard deviation of the network traffic are updated accordingly at 305. Once D is used 305iThe model is updated, and index i is incremented at 311 to evaluate the next baseline data.
If the current baseline data set does not reach 2000, and DiIs an outlier (as determined at 306) which is recorded at 308. At 309, further D is determinediWhether a third outlier is detected continuously. If so, meaning that there have been 3 outliers in succession, a fault alarm is generated at 310. At 310, index i is incremented. The above-mentioned process 302-312 is repeated for each section.
In some embodiments, DiIt may be randomly selected according to a specified probability, for example 50%. If D isiSelected, the earliest baseline data in the baseline data set is represented by DiInstead, the statistical model is updated accordingly. For example, DiAt the time of recalculating theThe mean and standard deviation are combined. If the current baseline data set has reached 2000, DiBeing added to the baseline data set does not replace any baseline data, but is used to recalculate the mean and standard deviation.
In the basic form, the link comprises a first side a and a second side B, and several parallel links that are functionally equivalent and share the traffic load between a and B together. There is an inlet flow and an outlet flow on each side. According to the present disclosure, any of a-port in (a _ in), a-port out (a _ out), B-port in (B _ in), B-port out (B _ out) may be used to characterize the link for fault detection purposes. The examples detailed herein are applicable to traffic for any connection of side and direction.
When one link fails, the link traffic of the link is likely to drop significantly, and the total traffic between a and B is automatically redistributed to each link. Thus, a significant change in link traffic distribution indicates a link failure. Fig. 4 illustrates a change in link traffic distribution of an exemplary link after a link failure thereof. As shown, under normal operating conditions, 4 links 401 and 404 share 20%, 30%, 40% and 10% of the total flow, for example, into side A, respectively. When link 401 fails, its duty cycle drops to 0% while the rest becomes 40%, 40% and 20%.
Fig. 5 is a flow diagram of an exemplary computer-implemented process 500 for real-time link failure detection based on traffic statistics, according to an embodiment of the present disclosure. Implementing process 500 may be performed by a monitoring device or a monitoring module within a monitored link, the monitoring device interactively coupled to a switching device of the monitored link. At 501, a representative statistical empirical model of link traffic distribution is generated from an initial baseline data set. For example, the baseline data set includes link traffic distribution data for N intervals, such as 1 minute each, with N being 100. The specific numbers are merely exemplary, and the present disclosure is not limited thereto. The span of intervals and sample size may be selected based on the following considerations, such as: data collection noise due to various engineering constraints, statistical properties of the traffic distribution, and sufficient representativeness of the probability distribution.
Each baseline datum within the baseline data set corresponds to a respective shared traffic fraction for all links in a particular direction (ingress or egress) on one side of the link. Traffic data for each link may be collected within each interval and provided to a monitoring device or monitoring module for fault detection purposes. The model may correspond to an expected link traffic distribution that includes a set of expected link traffic proportions. In some embodiments, the expected proportion of a link may be obtained by averaging the proportion of traffic for that link in the baseline data set. A normal zone and one or more abnormal zones may be defined as a function of the expected link traffic distribution.
At 502, the linked traffic data is collected and linked traffic distribution data is generated periodically, e.g., every minute, in the same manner as the baseline data set is generated at 501. At 503, the statistical model is updated in real-time using the new link traffic distribution data while keeping the amount of data in the baseline data set unchanged. At 504, each new link traffic distribution baseline data is evaluated according to the updated statistical model to determine whether it is within an anomaly region. At 505, if M outliers continue to appear, an alarm is generated that triggers various further operations, such as troubleshooting, diagnostic operations, and the like. For example, M is predefined to be 3.
FIG. 6 is a flow diagram of an exemplary computer-implemented process 600 for statistical model building of links and corresponding fault detection according to an embodiment of the disclosure. At 601, the span index "i" is set to 1. At 602, interval TiLink traffic distribution baseline data AiIt is determined according to the detected real-time flow accumulated in the interval. For example, AiIncluding side A inlet flows, A, of all links within an intervali=(V1_i,V2_i,V3_i,V4_i). In some embodiments, AiMay include ingress link traffic fraction derived from the link traffic or any other link traffic variable suitable for representing the distribution of link traffic.
At 603, A is determinediWhether or not it is a qualified baseline data, e.g.Whether the number of functional links that can provide valid traffic data has changed in the last 3 consecutive intervals. If so, a fault alarm is generated at 604.
For a qualifying baseline data, it is then determined whether to add it to the baseline data set of the statistical model. In particular, at 605, it is determined whether the current baseline data set is less than 100 baseline data. If so, the new baseline data AiTo a baseline data set for initial construction of the statistical model, e.g., the expected link traffic distribution is obtained from the baseline data set. At 614, index i is incremented.
In some embodiments, the expected distribution corresponds to an average distribution of the baseline data set. It is understood that various other forms of averages or other statistical indicators may be employed without departing from the scope of this disclosure.
If the baseline data set has reached 100, the distance between the current link traffic distribution and the expected link traffic distribution is evaluated at 607, and the result is then used to determine A at 608iWhether it is an abnormal value. For example, if the following expression holds, then AiDefined as the outlier:
Figure BDA0003221360720000111
wherein, Vj_iIs the ingress traffic of link j in interval i; vallIs the total ingress traffic through all links;
Figure BDA0003221360720000112
is the expected proportion of link j traffic in each interval according to the model; x is a preset threshold value.
If A isiFor an outlier, a record is made at 610. At 611, a is further determinediWhether it is the third outlier detected consecutively. If so, meaning that there have been 3 outliers in succession, an alarm is generated at 612. At 614, index i is incremented. If A isiNon-abnormal values, using AiThe statistical model is updated at 606 by replacing the oldest baseline data in the baseline data set. For example, AiAre combined in recalculating the expected link traffic distribution. At 614, index i is incremented. The above-described processes 602-614 are repeated for each interval.
In some embodiments, a in 606iRandomly selected according to a predetermined probability, for example, 10%. If A isiSelected, use AiReplacing the oldest baseline data in the baseline data set, thereby updating the statistical model.
FIG. 7 is a block diagram of an exemplary computing system 700 for real-time device failure detection and link detection based on traffic statistics, according to an embodiment of the present disclosure. The computing system includes: a main processor (CPU)701, a system memory 702, a Graphics Processing Unit (GPU)703, an I/O interface 704 and a network link 705, an operating system 706 and application software 710 comprising real-time fault detection modules 720 and 730 and stored in the memory 702. The system 700 is interactively coupled to a switching device through a network interface.
When the traffic data originating from switching device 750 is combined and executed by CPU 701, device failure detection module 720 can detect device failures in real time based on traffic statistics as described in detail in fig. 1-3. The device failure detection module 720 includes: a network traffic data generation module 721, a baseline data set module 722, a statistical model module 722, and an equipment failure handling module 724.
The network traffic data generation module 721 is configured to calculate the difference between the ingress and egress traffic of each inter-zone switching device 750. The baseline data set module 722 maintains a fixed data volume for the baseline data set by selectively accepting eligible new data and deleting the oldest data. The statistical model module 723 may calculate the mean and standard deviation of the baseline data set, and update these statistical indicators each time the baseline data set is updated with new baseline data. The equipment failure processing module 724 may determine whether a new baseline data is outlier based on the model, generate an alarm in response to detecting the successive outliers, and perform various other operations for failure detection, verification, and diagnosis.
When traffic data from the switching device 750 (or any other type of device in the link) is merged and executed by the CPU 701, the link failure detection module 720 may detect link failures in real-time based on traffic statistics as described in detail in fig. 4-6. The link failure detection module 730 includes: a link traffic distribution generation module 731, a baseline data set module 732, a statistical model module 733, and a link failure processing module 724.
The link traffic distribution generation module 731 is configured to calculate a link traffic ratio in the link for each section. The baseline data set module 732 maintains a fixed amount of data for the baseline data set by selectively accepting eligible new data and deleting the oldest data. The statistical model module 733 can calculate an expected link traffic distribution and update the expected distribution when the baseline data set is updated with a new baseline data. The link fault handling module 734 may determine whether a new baseline data is outliers based on the model, generate alarms in response to detecting successive outliers, and various other operations for fault detection, verification, and diagnosis.
It will be appreciated by those of ordinary skill in the art that the fault detection modules 720 and 730 may be implemented in any suitable programming language or languages known to those skilled in the art. In some embodiments, a system includes only one of the fault detection modules 720 and 730.
While certain preferred embodiments and methods have been disclosed herein, it will be apparent to those skilled in the art from this disclosure that variations and modifications of these embodiments and methods may be made without departing from the spirit and scope of the invention. It is intended that the invention be limited only to the extent required by the appended claims and the rules and principles of applicable law.

Claims (19)

1. A real-time fault detection method for a network switching device, the method comprising:
determining a network traffic baseline data for each of a plurality of intervals;
dynamically updating a network traffic statistical data set associated with each interval network traffic for the switching equipment according to the network traffic baseline data;
judging that the network flow baseline data is an abnormal value according to the network flow statistical data set and a preset threshold; and the number of the first and second electrodes,
a fault alert is generated in response to a predetermined amount of network traffic baseline data being determined to be an outlier.
2. The method of claim 1, wherein the ingress traffic corresponds to a sum of ingress traffic through all ingress ports of the switching device; the egress traffic corresponds to a sum of egress traffic through all of the egress ports of the switching device.
3. The method of claim 1, wherein the set of network traffic statistics comprises: mean and standard deviation of network traffic data over the baseline data set;
the judging that the network flow baseline data is an abnormal value according to the network flow statistical data set and a preset threshold value comprises the following steps:
determining the network traffic baseline data as an outlier based on a distance between the network traffic baseline data and the mean, and further based on a ratio between the distance relative to the standard deviation.
4. The method of claim 1, further comprising:
determining a network traffic statistics set of a baseline data set; wherein the baseline data set includes network traffic data for a first plurality of contiguous intervals.
5. The method of claim 4, wherein dynamically updating the set of network traffic statistics associated with each interval of network traffic for the switching device based on the network traffic baseline data comprises:
updating the baseline data set by adding network traffic baseline data for the interval to the baseline data set and deleting the oldest network traffic baseline data from the baseline data set; and the number of the first and second electrodes,
and recalculating the network traffic statistical data set according to the updated baseline data set.
6. The method of claim 5, wherein said updating said baseline data set comprises:
determining that the network traffic baseline data is qualified baseline data for the update process according to:
the previous network flow baseline data is not judged to be an abnormal value; the inlet flow rate and the outlet flow rate are greater than predetermined threshold values.
7. A real-time fault detection method for a network link group, the method comprising:
collecting real-time traffic of the network link group in each of a plurality of intervals; wherein the network link group comprises: a first end, a second end and a plurality of functionally equivalent links; wherein the real-time traffic of the interval comprises:
respective link traffic for a plurality of links from the first end to the second end; and the number of the first and second groups,
link traffic of a network link group from the first end to the second end;
dynamically updating the expected link flow distribution for the network link group according to the real-time flow of the interval;
evaluating the real-time link traffic of the interval according to the expected link traffic distribution;
judging the real-time flow of the interval as an abnormal value according to the deviation; and the number of the first and second electrodes,
and generating a fault alarm in response to the fact that the real-time flow of the preset number of intervals is judged to be an abnormal value.
8. The method of claim 7, further comprising:
determining real-time link traffic distribution of the plurality of links, the real-time link traffic distribution corresponding to a proportion of each link traffic of the interval relative to the link traffic;
wherein said evaluating real-time link traffic for said interval based on said expected link traffic distribution comprises:
and evaluating the real-time link traffic distribution according to the expected link traffic distribution.
9. The method of claim 7, wherein said evaluating real-time link traffic for said interval based on said expected link traffic distribution comprises:
the distance between the traffic fraction of each link and the expected traffic fraction of the link is evaluated.
10. The method of claim 7, further comprising:
determining the expected link traffic distribution according to a baseline data set, wherein the dynamically updating the expected link traffic distribution for the network link group according to the real-time traffic of the interval comprises:
updating the baseline data set by adding real-time traffic for the interval to the baseline data set and deleting the oldest real-time traffic from the baseline data set; and the number of the first and second electrodes,
the expected link traffic distribution is recalculated based on the updated baseline data set.
11. The method of claim 10, wherein said updating said baseline data set comprises:
determining real-time traffic for the interval as qualified baseline data for updating the baseline data set; and the number of the first and second electrodes,
randomly selecting real-time traffic for updating the interval of the baseline dataset according to a selection probability.
12. The method of claim 10, wherein said updating said baseline data set comprises:
determining real-time traffic for the interval as qualifying baseline data for updating the baseline data set according to:
the number of links of the plurality of links providing efficient real-time traffic remains unchanged with respect to a previous interval.
13. A system, comprising:
a processor; and a memory; the memory is coupled to the processor and stores instructions, wherein the instructions, when executed by the processor, implement a fault detection method for a switching device, wherein the method comprises:
determining a network traffic baseline data for each of a plurality of intervals;
dynamically updating a network traffic statistical data set associated with each interval network traffic for the switching equipment according to the network traffic baseline data;
judging that the network flow baseline data is an abnormal value according to the network flow statistical data set and a preset threshold; and the number of the first and second electrodes,
and generating fault alarm in response to the preset amount of network flow baseline data being judged as abnormal values.
14. The system of claim 13, wherein the inlet flow corresponds to a sum of inlet flows through all of the inlet ports of the switching device; the egress traffic corresponds to a sum of egress traffic through all of the egress ports of the switching device.
15. The system of claim 13, wherein the set of network traffic statistics comprises: mean and standard deviation of network traffic data over the baseline data set;
the judging that the network flow baseline data is an abnormal value according to the network flow statistical data set and a preset threshold value comprises the following steps:
and judging the network traffic baseline data to be an abnormal value according to the distance between the network traffic baseline data and the average value and further according to the ratio of the distance relative to the standard deviation.
16. The system of claim 13, wherein the method further comprises:
determining a network traffic statistics set of a baseline data set; wherein the baseline data set includes network traffic data for a first plurality of contiguous intervals.
17. The system of claim 16, wherein dynamically updating the set of network traffic statistics associated with network traffic for each interval for the switching device based on the network traffic baseline data comprises:
updating the baseline data set by adding network traffic baseline data for the interval to the baseline data set and deleting the oldest network traffic baseline data from the baseline data set; and the number of the first and second electrodes,
and recalculating the network traffic statistical data set according to the updated baseline data set.
18. The system of claim 17, wherein said updating said baseline data set comprises:
determining that the network traffic baseline data is qualified baseline data; and the number of the first and second electrodes,
randomly selecting network traffic baseline data for updating the baseline data set according to a selection probability.
19. The system of claim 17, wherein said updating said baseline data set comprises:
determining that the network traffic baseline data is qualified baseline data according to: the previous network flow baseline data is not judged to be an abnormal value; and, the inlet flow rate and the outlet flow rate are both greater than a predetermined threshold.
CN201980092647.2A 2019-05-15 2019-05-15 Network equipment and link real-time fault detection method and system based on flow statistics Pending CN113454950A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/087086 WO2020227985A1 (en) 2019-05-15 2019-05-15 Real-time fault detection on network devices and circuits based on traffic volume statistics

Publications (1)

Publication Number Publication Date
CN113454950A true CN113454950A (en) 2021-09-28

Family

ID=73289095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980092647.2A Pending CN113454950A (en) 2019-05-15 2019-05-15 Network equipment and link real-time fault detection method and system based on flow statistics

Country Status (2)

Country Link
CN (1) CN113454950A (en)
WO (1) WO2020227985A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887123A (en) * 2021-01-06 2021-06-01 新浪网技术(中国)有限公司 Service alarm method, system and device based on call chain
CN116938684B (en) * 2023-09-19 2023-12-26 北京锐服信科技有限公司 Network fault diagnosis method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130329571A1 (en) * 2011-03-03 2013-12-12 Hitachi, Ltd. Failure analysis device, and system and method for same
US20140269339A1 (en) * 2013-03-13 2014-09-18 Telekom Malaysia Berhad System for analysing network traffic and a method thereof
CN104717106A (en) * 2015-03-04 2015-06-17 贵州电网公司信息通信分公司 Distributed network traffic abnormity detection method based on multi-variable sequential analysis
CN104954192A (en) * 2014-03-27 2015-09-30 东华软件股份公司 Network flow monitoring method and device
CN107276808A (en) * 2017-06-21 2017-10-20 北京华创网安科技股份有限公司 A kind of optimization method of Traffic Anomaly monitoring
CN107733921A (en) * 2017-11-14 2018-02-23 深圳中兴网信科技有限公司 Network flow abnormal detecting method, device, computer equipment and storage medium
CN107888441A (en) * 2016-09-30 2018-04-06 全球能源互联网研究院 A kind of network traffics baseline self study adaptive approach

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1256638C (en) * 2001-02-02 2006-05-17 辽宁般若网络科技有限公司 Fault-tolerant array server
US7734778B2 (en) * 2002-04-05 2010-06-08 Sheng (Ted) Tai Tsao Distributed intelligent virtual server
CN105718715B (en) * 2015-12-23 2018-10-30 华为技术有限公司 Method for detecting abnormality and equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130329571A1 (en) * 2011-03-03 2013-12-12 Hitachi, Ltd. Failure analysis device, and system and method for same
US20140269339A1 (en) * 2013-03-13 2014-09-18 Telekom Malaysia Berhad System for analysing network traffic and a method thereof
CN104954192A (en) * 2014-03-27 2015-09-30 东华软件股份公司 Network flow monitoring method and device
CN104717106A (en) * 2015-03-04 2015-06-17 贵州电网公司信息通信分公司 Distributed network traffic abnormity detection method based on multi-variable sequential analysis
CN107888441A (en) * 2016-09-30 2018-04-06 全球能源互联网研究院 A kind of network traffics baseline self study adaptive approach
CN107276808A (en) * 2017-06-21 2017-10-20 北京华创网安科技股份有限公司 A kind of optimization method of Traffic Anomaly monitoring
CN107733921A (en) * 2017-11-14 2018-02-23 深圳中兴网信科技有限公司 Network flow abnormal detecting method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
WO2020227985A1 (en) 2020-11-19

Similar Documents

Publication Publication Date Title
US8156377B2 (en) Method and apparatus for determining ranked causal paths for faults in a complex multi-host system with probabilistic inference in a time series
JP6706321B2 (en) Method and device for service call information processing
US8230262B2 (en) Method and apparatus for dealing with accumulative behavior of some system observations in a time series for Bayesian inference with a static Bayesian network model
US8291263B2 (en) Methods and apparatus for cross-host diagnosis of complex multi-host systems in a time series with probabilistic inference
US8069370B1 (en) Fault identification of multi-host complex systems with timesliding window analysis in a time series
US7693982B2 (en) Automated diagnosis and forecasting of service level objective states
US7113988B2 (en) Proactive on-line diagnostics in a manageable network
US7711987B2 (en) System and method for problem determination using dependency graphs and run-time behavior models
US7509234B2 (en) Root cause diagnostics using temporal data mining
US8352789B2 (en) Operation management apparatus and method thereof
US20140258187A1 (en) Generating database cluster health alerts using machine learning
US20110276836A1 (en) Performance analysis of applications
US20160378583A1 (en) Management computer and method for evaluating performance threshold value
US8918345B2 (en) Network analysis system
CN104796273A (en) Method and device for diagnosing root of network faults
US20170069198A1 (en) Method for calculating error rate of alarm
JP2005065294A (en) Method and apparatus for sketch-based detection of changes in network traffic
CN101783749B (en) Network fault positioning method and device
EP3163519A1 (en) Methods for detecting one or more aircraft anomalies and devices thereof
CN111611146B (en) Micro-service fault prediction method and device
CN111796956A (en) Distributed system fault diagnosis method, device, equipment and storage medium
CN115118621B (en) Dependency graph-based micro-service performance diagnosis method and system
CN113454950A (en) Network equipment and link real-time fault detection method and system based on flow statistics
KR102234054B1 (en) Risk assessment device, risk assessment system, risk assessment method, risk assessment program and data structure
Strasser et al. Diagnostic alarm sequence maturation in timed failure propagation graphs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination