CN116938684B - Network fault diagnosis method and system - Google Patents

Network fault diagnosis method and system Download PDF

Info

Publication number
CN116938684B
CN116938684B CN202311206984.4A CN202311206984A CN116938684B CN 116938684 B CN116938684 B CN 116938684B CN 202311206984 A CN202311206984 A CN 202311206984A CN 116938684 B CN116938684 B CN 116938684B
Authority
CN
China
Prior art keywords
fault
network
flow index
dial testing
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311206984.4A
Other languages
Chinese (zh)
Other versions
CN116938684A (en
Inventor
张晓杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruifuxin Technology Co ltd
Original Assignee
Beijing Ruifuxin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruifuxin Technology Co ltd filed Critical Beijing Ruifuxin Technology Co ltd
Priority to CN202311206984.4A priority Critical patent/CN116938684B/en
Publication of CN116938684A publication Critical patent/CN116938684A/en
Application granted granted Critical
Publication of CN116938684B publication Critical patent/CN116938684B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/064Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0876Network utilisation, e.g. volume of load or congestion level

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The present invention relates to the field of network management, and in particular, to a method and system for diagnosing a network failure. The method can adopt progressive investigation logic to form investigation nesting, namely the analysis result of the previous step influences the subsequent investigation logic to obtain possible reasons of faults, then a dial testing execution result and dial testing flow indexes corresponding to the possible reasons of faults are obtained by constructing a dial testing task corresponding to the possible reasons of faults, and then a fault diagnosis result is obtained by combining the business flow indexes to input a fault removal knowledge base constructed based on experience. Therefore, not only is automatic obstacle removal realized, but also the obstacle removal logic is closer to the actual obstacle removal logic, and the obstacle removal is more accurate.

Description

Network fault diagnosis method and system
Technical Field
The present disclosure relates to the field of network management, and in particular, to a method and system for diagnosing a network failure.
Background
With the rapid development of information technology, computer networks have been widely used in various industries, the scale of networks has been continuously increased, and technologies have been continuously updated, which has provided great challenges for network management and fault diagnosis. Network failures can cause network outages, affect the normal operation of critical business systems, and even cause significant economic losses.
It can be seen that a diagnosis of network faults is required.
Disclosure of Invention
In view of this, the present application discloses a network fault diagnosis method. The method comprises the following steps: acquiring a first service flow index of a first network line corresponding to a network fault to be diagnosed; the first service flow index is used for indicating the flow transmission condition in the first network line; analyzing a fault area and a fault time point of the network fault according to the first service flow index; acquiring the change condition of a second service flow index of at least one preset second network line and/or an overall network before and after the fault time point; the second network line is associated with the first network line; the whole network is a network to which the first network line belongs; the second traffic flow index indicates a flow transmission condition in the second network line or the whole network; analyzing the fault influence range and possible fault reasons of the network fault to be diagnosed according to the change condition; according to the fault area, the fault influence range and the possible fault reasons, selecting at least one matched first dial testing task from preset dial testing tasks to execute, and acquiring a first execution result and a first dial testing flow index of each first dial testing task; the first dial testing flow index is used for indicating the flow transmission condition of the first dial testing task; and inquiring a pre-maintained obstacle avoidance knowledge base according to the first service flow index, the first execution result and the first dial testing flow index to obtain a matched first fault diagnosis result.
In some embodiments, the first traffic flow index includes a first number of incoming packets and a first number of outgoing packets that are subject to analysis by a destination in the first network line; the analyzing the fault area of the network fault according to the first traffic flow index includes: under the condition that the number of the first inflow packets and the number of the first outflow packets are both a first preset value, determining a fault area of the network fault as a space between a source end of the first network line and a flow acquisition point; the flow acquisition point is positioned between the source end and the destination end; determining a fault area of the network fault as between the flow acquisition point and the destination end under the condition that the first inflow packet number is larger than the first preset value and the first outflow packet number is equal to the first preset value; and under the condition that the number of the first inflow packets and the number of the first outflow packets are larger than the first preset value, determining a fault area of the network fault as a space between the flow acquisition point and the source end.
In some embodiments, the second network line comprises at least one of: other ports from the source end to the destination end; a network segment from the source end to the destination end; the source end is connected with other destination ends except the destination end; and the other source ends except the source end are connected to the destination end.
In some embodiments, in a case where the first number of incoming packets and the first number of outgoing packets are both a first preset value, the second traffic indicator is an incoming packet; when the number of the first inflow packets is larger than the first preset value, and the number of the first outflow packets is equal to the first preset value, the second service flow index is the outflow packet; and under the condition that the number of the first inflow packets and the number of the first outflow packets are both larger than the first preset value, the second service flow index is a new session or an active session.
In some embodiments, the selecting, according to the fault area, the fault influence range and the possible cause of the fault, at least one first dial testing task that is matched from preset dial testing tasks to execute includes: determining a dial testing task type of a first dial testing task to be executed according to the fault influence range and the possible fault reason; determining an executor of each type of the first dial testing task according to the fault area; and sending each first dial testing task to a corresponding executor for execution.
In some embodiments, the troubleshooting knowledge base includes service flow index nodes with connection relation, dial testing task execution result nodes, dial testing flow index nodes, conclusion nodes, root cause nodes and troubleshooting suggestion nodes, which are maintained based on historical experience; inquiring a pre-maintained obstacle avoidance knowledge base according to the first service flow index, the first execution result and the first dial measurement flow index to obtain a matched first fault diagnosis result, wherein the method comprises the following steps of: querying a target service flow index node matched with the first service flow index in the obstacle removing knowledge base, a target dial testing task execution result node corresponding to the first execution result, and a target dial testing flow index node corresponding to the first dial testing flow index; determining a connection weight and a maximum target conclusion node in conclusion nodes connected with the target service flow index nodes and the target dial testing task execution result nodes; determining a connection weight and a maximum target root node in the root nodes connected with the target service flow index nodes and the target dial testing task execution result nodes; determining a target investigation suggestion node with the largest connection weight from investigation suggestion nodes connected with the target root cause node; and outputting the first fault diagnosis result by the target troubleshooting suggestion node according to the target conclusion node and the target root cause node.
In some embodiments, after obtaining the first fault diagnosis result, the method further comprises: according to the first fault diagnosis result, selecting at least one matched second dial testing task from the preset dial testing tasks to execute; acquiring a second execution result and a second dial testing flow index of each second dial testing task; inquiring the obstacle removing knowledge base according to the second execution result and the second dial testing flow index to obtain a matched second fault diagnosis result; outputting the first fault diagnosis result under the condition that the second execution result and the second dial testing flow index are matched with the first fault diagnosis result; and carrying out fault diagnosis again under the condition that the second execution result is not matched with the first fault diagnosis result.
In some embodiments, in a case where the second execution result and the second dial traffic indicator match the first failure diagnosis result, the method further includes: adding the target conclusion node and the target service flow index node, wherein the target dial testing task execution result node is provided with a weight between the target dial testing flow index nodes; and adding the target root cause node and the target service flow index node, wherein the target dial testing task execution result node is the weight between the target dial testing flow index nodes.
In some embodiments, after the target root node outputs the first failure diagnosis result according to the target conclusion node, the target troubleshooting suggestion node further comprises: receiving feedback information of the troubleshooting suggestion in the first fault diagnosis result; under the condition that the feedback information is forward, increasing the weight between the target root cause node and the target investigation suggestion node; and in the case that the feedback information is negative, reducing the weight between the target root node and the target investigation suggestion node, and/or determining a first investigation suggestion node with the weight inferior to that of the target investigation suggestion node in the investigation suggestion nodes connected with the target root node, and outputting investigation suggestions according to the first investigation suggestion node.
The application also provides a network fault diagnosis system. The system comprises: the first acquisition unit acquires a first service flow index of a first network line corresponding to a network fault to be diagnosed; the first service flow index is used for indicating the flow transmission condition in the first network line; the first analysis unit is used for analyzing the fault area and the fault time point of the network fault according to the first service flow index; the second acquisition unit acquires the change condition of at least one preset second network line and/or a second service flow index of the whole network before and after the fault time point; the second network line is associated with the first network line; the whole network is a network to which the first network line belongs; the second traffic flow index indicates a flow transmission condition in the second network line or the whole network; the second analysis unit is used for analyzing the fault influence range and possible fault reasons of the network fault to be diagnosed according to the change condition; the dial testing unit is used for selecting at least one matched first dial testing task from preset dial testing tasks to execute according to the fault area, the fault influence range and the possible fault reasons, and acquiring a first execution result and a first dial testing flow index of each first dial testing task; the first dial testing flow index is used for indicating the flow transmission condition of the first dial testing task; and the matching unit is used for inquiring a pre-maintained obstacle avoidance knowledge base according to the first service flow index, the first execution result and the first dial measurement flow index to obtain a matched first fault diagnosis result.
In the solution described in the foregoing embodiment, a progressive investigation logic may be adopted to form an investigation nest, that is, the analysis result of the previous step affects the subsequent investigation logic to obtain the possible cause of the fault, then a dial testing execution result and a dial testing flow index corresponding to the possible cause of the fault are obtained by constructing a dial testing task corresponding to the possible cause of the fault, and then a fault diagnosis result is obtained by inputting a fault removal knowledge base constructed based on experience in combination with the service flow index. Therefore, not only is automatic obstacle removal realized, but also the obstacle removal logic is closer to the actual obstacle removal logic, and the obstacle removal is more accurate.
Drawings
The drawings that are required for use in the description of the embodiments or the related art will be briefly described below.
Fig. 1 is a method flow diagram of a network fault diagnosis method shown in the present application.
Fig. 2 is a method flow diagram of a matching method of a fault clearing knowledge base shown in the present application.
Fig. 3 is a flow chart of the conclusion verification method shown in the present application.
Fig. 4 is a schematic view of an enterprise network scenario illustrated in the present application.
Fig. 5 is a flow chart of a network fault diagnosis method shown in the present application.
Fig. 6 is a schematic flow chart of an analysis index of the analysis device shown in the present application.
Fig. 7 is a system configuration diagram of a network fault diagnosis system shown in the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items. It will also be appreciated that the term "if," as used herein, may be interpreted as "at … …" or "at … …" or "responsive to a determination," depending on the context.
In some related technologies, common methods for locating network faults mainly include:
ping, traceroute test. The Ping command is used for testing connectivity and time delay between two nodes in the network, the Traceroute command can be used for testing routing paths and time delay between the nodes, and network nodes with high time delay or unreachable network nodes can be found, so that faults of corresponding network segments or equipment can be judged.
And (5) flow analysis. Flow probe equipment deployed in the network is used for collecting flow and performance data of the whole network or the local network, including indexes such as interface flow utilization rate, packet loss rate, time delay and the like, and then the whole running condition and performance bottleneck of the network are judged by manually analyzing the historical or real-time data. This may discover performance problems or faults in the network, providing problem localization directions for network maintenance personnel.
However, the current method for detecting network faults by ping, traceroute tools or deployment of flow analysis equipment requires a high level of skill and experienced network engineers, which limits the scope and speed of manual troubleshooting. In particular, the related art includes at least the following drawbacks:
first, it relies on human experience. The network condition is judged and the fault cause is determined mainly by depending on the abundant experience of network engineers, the degree of automation and intelligence is not high, and the labor cost is high.
Second, diagnostic efficiency is low. And a large amount of network data and events are manually processed, the network problems are analyzed and judged, a long time is required for locating the fault reasons, and a large amount of or high-frequency network problems are difficult to deal with.
Third, the process is not repeatable. Manual network fault location is a process which invests a great deal of time and effort, but the detailed process of fault diagnosis is lack of effective records, so that historical experience cannot be recycled, the efficiency of subsequent location work is improved, and the continuous improvement of the problem location level is limited.
In view of this, the present application proposes a network fault diagnosis method. In the method, progressive investigation logic can be adopted to form investigation nesting, namely, the analysis result of the previous step influences the subsequent investigation logic to obtain possible reasons of faults, then a dial testing execution result and dial testing flow indexes corresponding to the possible reasons of faults are obtained by constructing a dial testing task corresponding to the possible reasons of faults, and then a fault diagnosis result is obtained by combining the business flow indexes to input a fault removal knowledge base constructed based on experience. Therefore, not only is automatic obstacle removal realized, but also the obstacle removal logic is closer to the actual obstacle removal logic, and the obstacle removal is more accurate.
The following description of the embodiments is made with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a schematic flow chart of a network fault diagnosis method shown in the present application. As shown in FIG. 1, the method may include S102-S112.
S102, obtaining a first service flow index of a first network line corresponding to the network fault to be diagnosed.
The network fault to be diagnosed refers to a fault which occurs in the network and needs to be diagnosed. The network fault diagnosis method and device can diagnose the network fault to be diagnosed. The present application does not limit the type of failure for which a network failure is to be diagnosed. Common manifestations of network failure may include packet loss, inability to connect properly, and so forth.
The first network line refers to a line on which the network fault to be diagnosed occurs. The network lines generally include a source end and a destination end.
The first traffic flow indicator is configured to indicate a traffic transmission condition in the first network line, and may include an inflow packet number, an outflow packet number, a reset packet number, a new session number, an active session number, an interface traffic utilization rate, a packet loss rate, a time delay, and so on. The method and the device can preset the first service flow index type according to the service requirement and monitor the first service flow index type.
In some embodiments, an analysis device may be deployed at a core location (e.g., a core switch) of the network, and collect the whole network traffic to form a traffic index, thereby completing the acquisition of the first traffic.
S104, analyzing the fault area and the fault time point of the network fault according to the first service flow index.
The fault area is an area which is possibly faulty and is obtained through analysis of the first service flow index.
In some embodiments, some mapping relationships between matching conditions and analysis results may be preset in the analysis device. The matching condition is set based on the first traffic flow index. For example, the first traffic flow index includes an inflow packet number and an outflow packet number, the matching condition may include that the inflow packet number and the outflow packet number are both greater than 0, the inflow packet number is greater than 0 and the outflow packet number=0, the inflow packet number and the outflow packet number are both equal to 0, and the like. The fault region may be empirically set for different matching conditions.
And matching the first service flow index acquired in the step S102 with each matching condition to find a matched target matching condition, and then determining a fault region in the step S104 according to a fault region corresponding to the target matching condition.
In some embodiments, the first traffic flow index includes a first number of incoming packets and a first number of outgoing packets for which a destination in the first network line is an analysis object. The destination may be a server.
In S104, a fault area of the network fault may be determined to be between a source end and a traffic collection point of the first network line under the condition that the first number of inflow packets and the first number of outflow packets are both a first preset value; the flow acquisition point is positioned between the source end and the destination end;
determining a fault area of the network fault as between the flow acquisition point and the destination end under the condition that the first inflow packet number is larger than the first preset value and the first outflow packet number is equal to the first preset value;
and under the condition that the number of the first inflow packets and the number of the first outflow packets are larger than the first preset value, determining a fault area of the network fault as a space between the flow acquisition point and the source end.
The flow collection point may be referred to as an analysis device.
The first preset value can be set according to requirements. For example, 0, 1 or 5, etc. The judgment logic for the first inflow packet number and the first outflow packet number may be set as a matching condition, and the corresponding failure region may be set as an analysis result. The possible fault areas can thus be evaluated for different first traffic indicators.
For example, the first preset value is 0. If the first number of incoming packets and the first number of outgoing packets are both equal to 0, this indicates that a fault region may be between the source and the analysis device. If the first number of incoming packets is greater than 0 and the first number of outgoing packets is equal to 0, this indicates that a fault region may be between the analysis device and the destination. If the number of the first inflow packets and the number of the first outflow packets are both greater than 0, the destination end can normally receive the packets of the source end and can normally send out the feedback packets, but the flow cannot be transmitted back to the source end, which indicates that a fault area may be between the analysis equipment and the source end.
The failure time point refers to a time point when the network failure occurs. In some embodiments, the time when the data of the first flow index suddenly changes may be generally the fault time point. For example, in the case where the network is normal, the number of inflow packets is greater than 0, and if the number of inflow packets is suddenly 0 at a certain time, indicating that a network failure occurs, the time, or a certain time around the time, may be taken as the failure time point.
S106, obtaining the change condition of at least one preset second network line and/or a second service flow index of the whole network before and after the fault time point.
The time point before and after the fault point refers to a certain time or a certain period before the fault point and a certain time or a certain period after the fault point. And the change condition of the second service flow index can be obtained by analyzing the index average value of the two moments or the two time periods. The second traffic flow index indicates traffic transmission conditions in the second network line or the overall network, and the explanation thereof may refer to the first traffic flow index.
And analyzing the change condition to obtain the fault influence range and possible fault reasons of the network fault to be diagnosed.
For example, if the traffic index of a certain second network line suddenly changes from non-0 to 0 before and after the failure time point, it is explained that the network failure is affected. If the index of the other ports from the source end to the destination end is changed from non-0 to 0, the index of the source IP- > destination IP network segment is kept to be non-0, it can be stated that the failure is possible because a certain device may not have destination end route between the source segment and the acquisition point or a certain device makes a policy to limit the access to the destination end.
The second network line is associated with the first network line. The correlation means that the second network line is the same as or associated with at least one of the following of the first network line: source end, destination port.
In some embodiments, the second network line may include at least one of:
other ports from the source end to the destination end;
a network segment from the source end to the destination end;
the source end is connected with other destination ends except the destination end;
and the other source ends except the source end are connected to the destination end.
The whole network is a network to which the first network line belongs.
Therefore, whether faults occur to the second network line and/or the whole network related to the first network line can be judged in an omnibearing way, the influence range of network faults is determined, and the possible reasons of the faults can be analyzed by combining the traffic transmission conditions of the second network lines and/or the whole network.
In some embodiments, in a case where the first number of incoming packets and the first number of outgoing packets are both a first preset value, the second traffic indicator is an incoming packet;
when the number of the first inflow packets is larger than the first preset value, and the number of the first outflow packets is equal to the first preset value, the second service flow index is the outflow packet;
and under the condition that the number of the first inflow packets and the number of the first outflow packets are both larger than the first preset value, the second service flow index is a new session or an active session.
For example, when the number of the first ingress packets and the number of the first egress packets are both 0, it is indicated that the network source end has a fault to the traffic collection point, and other dimensional analysis can be performed by analyzing the ingress packets, so that only the ingress packets of the second network line need to be counted in this case.
When the number of the first inflow packets is greater than 0 and the number of the first outflow packets is 0, it is indicated that the end from the flow collection point to the network destination has a fault, and other dimension analysis can be performed by analyzing the outflow packets, so that only the outflow packets of the second network line need to be counted under the condition.
Under the condition that the number of the first inflow packets and the number of the first outflow packets are both larger than 0, the situation that the flow acquisition point is faulty to the network source end is indicated, at the moment, analysis of the inflow packets or the outflow packets cannot be performed in other dimensions, and the indexes of new sessions or active sessions are required to be followed to analyze other dimensions, so that the number of the new sessions or the active sessions of the second network circuit is only required to be counted under the condition.
Therefore, the service flow related to the fault area can be obtained, on one hand, the pertinence of analysis is improved, on the other hand, the index quantity required to be analyzed is reduced, and the analysis efficiency is improved.
S108, analyzing the fault influence range and possible fault reasons of the network fault to be diagnosed according to the change condition.
In some embodiments, some matching logic may be set at the analysis device, where the matching logic includes a number of matching conditions set based on the second traffic flow index, and an analysis result (including a fault impact range and a fault possible cause) corresponding to the matching conditions. The second traffic index change obtained in S106 can analyze the fault influence range and the possible cause of the fault.
For example, if the number of packets flowing in from the source IP to the destination IP destination port=0 and the number of packets flowing out=0, then a fault may occur between the source IP and the collection point, and the trend of the number of packets flowing in is analyzed to obtain the fault time point a. Counting the variation condition of the inflow packet number of the second network line before and after the point A:
other ports from the source end to the destination end are not 0 to 0;
the network segment from the source end to the destination end is kept to be non-0;
the source end is kept to be non-0 to other destination ends except the destination end;
other source ends to the destination end, keeping 0;
the overall network remains non-0.
Matching logic aiming at the conditions can be set in analysis equipment, and the network fault is analyzed to influence the access from a source end to a destination end; and since no access traffic exists between other source terminals and the destination terminal, whether the traffic is affected is not determined, and the traffic data needs to be supplemented by dial testing. The possible cause of the fault can be analyzed based on the above circumstances: some equipment between the source end and the acquisition point may not have destination end routing, and some equipment makes a policy to limit the access to the destination end.
S110, selecting at least one matched first dial testing task from preset dial testing tasks to execute according to the fault area, the fault influence range and the possible fault reasons, and acquiring a first execution result and a first dial testing flow index of each first dial testing task; the first dial testing flow index is used for indicating the flow transmission condition of the first dial testing task.
The preset dial testing task can be set according to the requirement. For example, TCP dial tests, ICMP dial tests, traceroute dial tests, TCP Traceroute dial tests, HTTP dial tests, HTTPS dial tests, SSH dial tests, telnet dial tests, and the like may be included. The step can select one or more dial testing tasks to execute, and a first execution result and a first dial testing flow index are obtained.
In some embodiments, the mapping relationship of the matching conditions and the analysis results (dial testing tasks and executors) may be set in the analysis device. Some conclusions obtained through S104 and S108 can be analyzed and selected to select the corresponding dial testing task and the executor, and then an instruction for executing the dial testing task is sent to the executor, so that the executor can execute the corresponding dial testing task and return an execution result.
In some embodiments, a dial testing task type of the first dial testing task to be executed may be determined according to the fault influence range and the possible fault cause; then determining an executor of each type of the first dial testing task according to the fault area; and then, each first dial testing task is sent to a corresponding executor for execution.
In some modes, the fault influence range and possible fault reasons can be used as matching conditions in advance, and the corresponding dial testing task types are used as matching results to form some mapping relations. Some of the conclusions obtained through S104 and S108 are matched, so that the mapped dial testing task is obtained. And then determining an executor according to the fault area, and sending a dial testing task instruction for executing the dial testing task corresponding to the determined dial testing task to the executor so that the executor can execute the corresponding dial testing task and return an execution result.
For example, assuming that the failure affecting range is analyzed as a source end to a destination end, other source ends are not determined to the destination end, the possible cause of the failure is that a certain device may not have destination end route between the source end and the collection point, the certain device makes a policy to limit the access to the destination end, the mapping relation which is maintained in advance by taking the failure affecting range and the possible cause of the failure as input is matched, so that TCP dial testing needs to be executed for the source end to the destination end, TCP dial testing needs to be executed for other source ends to the destination end, and TCP traceroute dial testing tasks need to be executed for the source end to the destination end. Then, assuming that the fault area is analyzed as a source end to a flow acquisition point, the source end of the fault area can be determined to execute a dial testing task. Then an instruction for executing the dial testing task can be sent to the source end, the same network segment address IP1 of the source end is simulated to send a TCP request to the destination end, the different network segment address IP2 of the source end is simulated to send a TCP request to the destination end, and the source end performs TCP traceroute detection to the destination port of the destination end. The execution results of these dial testing tasks and the generated dial testing flow index are summarized in the analysis device.
And S112, inquiring a pre-maintained obstacle avoidance knowledge base according to the first service flow index, the first execution result and the first dial measurement flow index to obtain a matched first fault diagnosis result.
The obstacle-removing knowledge base is an experience base constructed based on knowledge graphs. The fault removal knowledge base comprises a connection relation of service flow indexes, execution results, dial-up flow indexes and fault diagnosis results which are constructed based on historical experience.
The conclusion obtained through S102-S110 can be input into a configuration knowledge base to be matched so as to obtain a corresponding fault diagnosis result.
The content included in the fault diagnosis result can be set based on the requirement. For example, diagnostic conclusion results, root cause analysis results, troubleshooting advice results may be included. In the following embodiment, the matching logic of the multi-output result in the knowledge base is described, the network diagnosis scene is met, and the output result is more accurate.
Taking the previous example as an example. Assume that the execution result of the TCP dial testing task is that the destination ports from IP1 to the destination cannot be connected, and the destination ports from IP2 to the destination cannot be connected.
The IP address of the network node through which the TCP traceroute dial testing task returns from the source end to the destination end and the destination port passes comprises IP3 IP4, IP5, IP6, IP7,; an% x, x. Wherein 7 after IP7 represent no response by the detected network node, and 7 after IP7 represent no response by the network node after IP7, indicating that the forwarding policy after IP7 needs to be examined.
Dial flow index for these dial tasks: the number of inflow packets from IP1 to the destination is 0, and the number of inflow packets from IP2 to the destination is greater than 0.
The above contents are input into the barrier removal knowledge base, so that a diagnosis conclusion can be obtained that the fault affects the access of the source IP network segment to the destination IP, and the configuration of the forwarding strategy of the check requesting node (IP 7) and the next hop node is suggested because the access of the source IP network segment to the destination IP is limited by the node (IP 7) outlet forwarding strategy or the next hop node inlet forwarding strategy.
Through the scheme recorded in S102-S112, a fault area and a fault time point can be primarily analyzed according to a first service flow index, then a fault influence range and a fault possible reason are further analyzed according to the change condition of Xiang Di second-pass service flow indexes before and after the fault time point, then a first dial testing task can be constructed based on the fault area, the fault influence range and the fault possible reason and by combining historical experience, and a dial testing task result and a dial testing flow index corresponding to the fault possible reason are obtained; and finally, inquiring a pre-maintained obstacle avoidance knowledge base based on the first service flow index, the first execution result and the first dial measurement flow index to obtain a matched first fault diagnosis result.
Therefore, progressive investigation logic can be adopted to form investigation nesting, namely, the analysis result of the previous step influences the subsequent investigation logic to obtain possible reasons of faults, then a dial testing execution result and dial testing flow index corresponding to the possible reasons of faults are obtained by constructing a dial testing task corresponding to the possible reasons of faults, and then a fault removal knowledge base constructed based on experience is input by combining the business flow index to obtain a final fault diagnosis result.
Compared with the related art, at least the following technical effects are included:
first, the reliance on highly skilled talents is reduced. The automatic detection, localization and diagnosis of network faults can replace manual operation at least in part, and the dependence on high-skill network engineers is reduced, so that the working pressure of technicians can be reduced.
Second, the failure handling efficiency is improved. The automation system can rapidly and accurately analyze a large amount of network data, locate the root cause of the problem, complete fault diagnosis and have higher working efficiency than manual operation. The network service can be restored quickly, the service interruption time caused by faults is shortened to the maximum extent, and the enterprise loss is reduced.
Third, cure history experience. The obstacle-removing knowledge base is built on the basis of a large number of historical cases and knowledge summaries, can be reused in problem diagnosis, and can be continuously enriched and optimized. Compared with manual operation, the method can solidify experience, accumulate and share experience, and avoid knowledge loss.
Fourth, more closely follow real obstacle-removing logic, make the obstacle-removing more accurate.
In some embodiments, the TOS field of the data packet sent by the first dial testing task is set to 0010.
The TOS includes a 4bit field, which are combined to indicate the class of service to which the packet corresponds. If 1000, represents the minimum delay; if 0100, maximum throughput is represented; if 0010, representing high reliability; if 0001 represents a minimum cost; if 0000, represents a general service.
In the method, the TOS field of the data packet in which the dial testing task occurs is set to 0010, so that on one hand, the reliability of transmission can be guaranteed, and on the other hand, the analysis equipment can identify dial testing flow, and therefore flow index analysis is conducted.
In the scene of multi-layer matching of the knowledge graph, a link matching mode is often adopted. For example, it is desirable to match B by A, C by A and B, D by A, B and C, and E by A, B, C and D. The link type matching mode is not matched with a network fault diagnosis scene, and the matching accuracy is not high under the scene, so that the fault diagnosis result is influenced.
In the network fault diagnosis scene, the matching conditions and each matching result are associated, the matching results are not in a linear relationship, and the matching conditions do not need to be necessarily associated. For example, the service flow index (can be regarded as a), the dial testing task execution result (can be regarded as B), the dial testing flow index (can be regarded as C) need not have necessary connection, the above three are all related to the diagnosis conclusion (can be regarded as D), the root cause analysis (can be regarded as E), but the diagnosis conclusion and the root cause analysis have no necessary connection, so the link-type matching effect is not good.
In some embodiments, a node with the largest connection weight can be selected from target nodes with connection relation with service flow indexes and dial testing task execution results, and dial testing flow indexes as a matching result, so that a network fault diagnosis scene is attached to, matching accuracy is improved, and accuracy of a fault diagnosis result is improved.
Referring to fig. 2, fig. 2 is a method flow diagram of a matching method of a fault clearing knowledge base shown in the present application. The method illustrated in fig. 2 is illustrative of some embodiments of S112. As shown in fig. 2, the matching method may include S202-S210.
The obstacle removing knowledge base comprises service flow index nodes, dial testing task execution result nodes, dial testing flow index nodes, conclusion nodes, root cause nodes and troubleshooting suggestion nodes which are maintained based on historical experience and have connection relations.
In some ways, service flow indexes, dial testing task execution results, dial testing flow indexes, diagnosis conclusions, root cause and investigation suggestions can be obtained based on historical experience, and the contents are usually stored in a relational table in the related art. The method and the device can form corresponding nodes based on the association relation of the contents of the relation expression, form connecting lines between the nodes and give an initial weight value, so that a barrier removal knowledge base is formed. Of course, the fault diagnosis knowledge base can be updated in an online or offline mode in the use process of the fault diagnosis knowledge base so as to update the knowledge of the fault diagnosis knowledge base and improve the fault diagnosis effect.
S202, inquiring a target service flow index node matched with the first service flow index in the obstacle removing knowledge base, a target dial testing task execution result node corresponding to the first execution result and a target dial testing flow index node corresponding to the first dial testing flow index.
In this step, a query statement may be constructed based on the content obtained by the analysis in S102-S110, and a corresponding node may be queried.
S204, determining a connection weight and a maximum target conclusion node in conclusion nodes connected with the target service flow index nodes and the target dial testing task execution result nodes.
The conclusion node stores the diagnostic conclusion. The diagnostic conclusion includes the range of effects of this fault.
In this step, a query statement may be constructed based on the target traffic index node queried in S202, the target dial testing task execution result node and the target dial testing traffic index node as query conditions, to obtain all conclusion nodes connected with the three nodes, and then weights between each conclusion node and the three target nodes are respectively determined, and the weight and the largest conclusion node are determined as the target conclusion node, thereby obtaining the range of the influence of the fault.
S206, determining a connection weight and a maximum target root node in the root nodes connected with the target service flow index nodes and the target dial testing task execution result nodes.
The follow-up node stores root cause analysis conclusions. Root cause analysis conclusions include the root cause of this failure.
In this step, a query statement may be constructed based on the target traffic index node queried in S202, the target dial testing task execution result node and the target dial testing traffic index node as query conditions, to obtain all root cause nodes connected to the three, and then determine weights between each root cause node and the three target nodes, and determine the root cause node with the greatest weight as the target root cause node, thereby obtaining the root cause of the fault.
And S208, determining the target investigation suggestion node with the largest connection weight from the investigation suggestion nodes connected with the target root cause node.
The root cause node and the investigation suggestion node belong to a direct deduction relation, so that after the root cause is obtained, a query statement can be constructed, and the target investigation suggestion node with the largest connection weight is determined from the investigation suggestion nodes connected with the target root cause node.
S210, according to the target conclusion node, the target root cause node and the target troubleshooting suggestion node output the first fault diagnosis result.
And obtaining corresponding contents from the corresponding nodes, namely obtaining a fault influence range, a root cause and an troubleshooting suggestion, and then generating a first fault diagnosis result according to a preset template. The diagnosis result related to the network fault can be directly obtained for the user, and the automatic diagnosis is realized without the participation of professional staff.
In some embodiments, the fault clearing knowledge base may be updated in real time, after the first fault diagnosis result is obtained, a corresponding dial testing task may be constructed based on the first fault diagnosis result to perform conclusion verification, and a diagnosis result is output under the condition that verification is passed, so that the fault diagnosis accuracy is improved.
Referring to fig. 3, fig. 3 is a flow chart of the conclusion verification method shown in the present application. The method as shown in fig. 3 may include S302-S310.
S302, selecting at least one matched second dial testing task from the preset dial testing tasks to execute according to the first fault diagnosis result.
The first fault diagnosis result comprises a diagnosis conclusion and a fault reason, and the diagnosis conclusion comprises a fault influence range. Then a matching condition can be constructed based on the first fault diagnosis result, at least one matched second measurement task can be obtained in the measurement task mapping relation and sent to an executor for execution. The specific flow may refer to the description of the matching dial testing task in S110.
S304, a second execution result and a second dial testing flow index of each second dial testing task are obtained.
The execution result can be obtained through the feedback of the executor, and the second dial testing flow index can be obtained through the monitoring of the flow monitoring equipment.
S306, inquiring the obstacle avoidance knowledge base according to the second execution result and the second dial testing flow index to obtain a matched second fault diagnosis result.
The step may be performed by referring to the process illustrated in S112 and related embodiments thereof, to obtain the second fault diagnosis result.
S308, outputting the first fault diagnosis result under the condition that the second execution result and the second dial testing flow index are matched with the first fault diagnosis result.
When the obtained second failure diagnosis result matches the first failure diagnosis result, the first failure diagnosis result can be outputted, indicating that the previous diagnosis is correct.
And S310, carrying out fault diagnosis again in the case that the second execution result is not matched with the first fault diagnosis result.
When the second fault diagnosis result is inconsistent with the first fault diagnosis result, it is indicated that the fault diagnosis is performed again by S102-S112 and subsequent fault verification and correction based on the previous diagnosis of the latest fault clearing knowledge base.
In some embodiments, when the second execution result and the second dial measurement flow indicator are matched with the first fault diagnosis result, it is indicated that the first fault diagnosis result is correct, and the weights between two nodes in the fault clearing knowledge base may be adjusted to strengthen the connection between the two nodes, so as to ensure that a similar first fault diagnosis result may still be output subsequently.
Specifically, the target conclusion node and the target traffic index node may be increased, the target dial testing task execution result node, and the weight between the target dial testing traffic index nodes;
and adding the target root cause node and the target service flow index node, wherein the target dial testing task execution result node is the weight between the target dial testing flow index nodes.
For example, a new weight may be obtained by adding a value to the original weight. The added numerical value has a certain inverse mathematical relationship with the original weight, namely, the larger the original weight is, the smaller the added numerical value is, the smaller the original weight is, and the larger the added weight is, so that the weight can be quickly adjusted, and the weight cannot be excessively large.
In some embodiments, the troubleshooting advice may be pushed continuously according to the feedback of the user to the troubleshooting advice, and the troubleshooting knowledge base may be updated to ensure that the effective troubleshooting advice may be exited preferentially later.
Specifically, feedback information for the troubleshooting suggestion in the first failure diagnosis result is received.
And under the condition that the feedback information is forward, increasing the weight between the target root cause node and the target investigation suggestion node.
And in the case that the feedback information is forward, reducing the weight between the target root node and the target investigation suggestion node, and/or determining a first investigation suggestion node with the weight inferior to that of the target investigation suggestion node in the investigation suggestion nodes connected with the target root node, and outputting investigation suggestions according to the first investigation suggestion node.
For example, a window may be provided for the user to input feedback information, where the user may feedback the corresponding information. If the user feedback is forward, the investigation suggestion output before the description is proper, the weight between the target root cause node and the target investigation suggestion node can be increased, so that the reasonable investigation suggestion can still be pushed next time. If the user feedback is negative, the investigation suggestion output before the description is unreasonable, and the weight between the target root cause node and the target investigation suggestion node can be reduced. Of course, the second-ranked review suggestion may also be output. The weight between the troubleshooting suggestion node and the root cause node can be updated according to the feedback of the user, so that the correct troubleshooting suggestion can be output preferentially by the follow-up.
The following description of the fault diagnosis embodiments is made in connection with an enterprise network scenario. Referring to fig. 4, fig. 4 is a schematic view of an enterprise network scenario illustrated in the present application. As shown in fig. 4, a user terminal 410, a core switch 420, an analysis device 430 connected to the core switch 420, and a server 440 may be included.
The user terminal 410 and the server 440 communicate through the core switch 420, and the analysis device 430 may mirror the traffic passing through the core switch 420 to form a traffic flow index, dial the traffic flow index, and the like for fault diagnosis. Analysis device 430 also incorporates a fault removal knowledge base.
The analysis device 430 may analyze traffic indexes in the first network link (may be abbreviated as a faulty network) where the fault occurs, find a fault time point, a fault area, a fault influence range, and a possible cause of the fault, and issue different dial testing tasks according to different results of the indexes.
Analyzing device 430 may include three steps to analyze the fault. Firstly, inquiring a service flow index of a fault network, analyzing the change trend of the index to find a fault time point, and analyzing the index value to find a fault region: the service source address is between the traffic acquisition point and the traffic destination address.
In the second step, the analysis device 430 queries the traffic index values of other dimensions before and after the fault time point, and obtains the fault influence range by comparing the changes of the index values. Other dimensions include: source IP to destination IP other ports, source IP to destination IP network segment, source IP to other IPs, other IPs to destination IP, and overall network. The service indexes of the dimensions can be used for analyzing the fault influence range and possible reasons of faults. For example, the fault impact range is from source end to destination end, and the possible reasons for the fault are physical link problem, routing table problem, ARP entry problem, forwarding policy problem, etc.
Third, the analysis device 430 issues different dial testing tasks according to the analysis result of the traffic flow index. The executor of the dial testing task has analysis equipment and service source IP, and one or two of the analysis equipment and the service source IP can be selected. The dial test task type is TCP dial test, ICMP dial test, traceroute dial test, TCP Traceroute dial test, HTTP dial test, HTTPS dial test, SSH dial test, telnet dial test, etc., and one or more of the dial test task types can be selected. The dial testing target has service destination IP, other service IP and network key nodes, and one or more of the dial testing targets can be selected. The user terminal 410 and the analysis device 430 may have installed therein dial testing software (dial testing program) for performing dial testing tasks. The analysis device 430 is in network communication with the user terminal 410, and issues a dial testing task to a dial testing procedure of the user terminal 410 through a private protocol. After the dial testing program executes the dial testing task, the result is returned to the analysis equipment. Flow data generated by performing the dial testing task is collected by the analysis device 430 to form a dial testing flow index.
In the application, the TOS field value of the data packet sent by the dial testing task is set to 0010, so that the reliability of transmission is ensured, and the TOS field value is used for identifying dial testing flow by analysis equipment. The dial testing program performs the dial testing task and returns the result to the analysis device, and the analysis device 430 reads the task result and the flow index generated in the dial testing process.
The analysis device 430 matches the service flow index, the dial testing task result, and the dial testing flow index with the built-in fault clearing knowledge base to obtain diagnosis results such as fault reasons, fault influence ranges (fault conclusions), troubleshooting suggestions, and the like.
After the analysis device 430 obtains the diagnosis result, the verification and investigation conclusion of the dial testing task is issued. And outputting the checking result if the verification is passed, otherwise, re-analyzing by the analyzing equipment.
Referring to fig. 5, fig. 5 is a schematic flow chart of a network fault diagnosis method shown in the present application. As shown in fig. 5, the method may include S501-S509. The method may apply the analysis device 430.
S501, after a network fails, source IP, destination IP and destination port information of the network failure can be obtained.
This step may be entered by a user or the failure network information recorded after the failure is read.
S502, extracting first business flow indexes of each dimension and carrying out first-stage analysis.
Referring to fig. 6, fig. 6 is a schematic flow chart of an analysis index of the analysis device shown in the present application. As shown in fig. 6, the flow may include S601-S606.
S601, acquiring the following first traffic indexes from a source IP to a destination IP destination port: new session number, active session number, reset packet number, inflow packet number, and outflow packet number.
S602, if the newly built session number is larger than 0 or the active session number is larger than 0, the network connectivity from the source IP to the destination IP destination port is normal, the fault can be caused by the application, and the analysis equipment analyzes the fault.
S603, if the reset packet number is larger than 0, the network connectivity from the source IP to the destination IP destination port is normal, and the fault is possibly caused by the fact that the service port is not opened and is analyzed by the analysis equipment.
S604, analyzing the number of inflow packets, the number of outflow packets and the index change trend to obtain a fault area and a fault time point.
If the number of incoming packets is equal to 0 and the number of outgoing packets is equal to 0, then a failure occurs between the source IP and the analysis device location. If the number of incoming packets is greater than 0 and the number of outgoing packets is equal to 0, then a failure occurs between the analysis device and the traffic destination IP. If the number of incoming packets is greater than 0 and the number of outgoing packets is greater than 0, then a failure occurs between the analysis device and the source IP.
The step of determining the failure time point may refer to S104.
S605, inquiring second service flow indexes before and after the fault time point.
The second traffic flow index includes: the number of ingress packets, or the number of egress packets or active sessions, of the source IP to destination IP other ports, source IP to destination IP network segments, source IP to other IPs, other IPs to destination IP, the overall network. The specific flow rate index may be determined according to the failure area judged in S604. This step can be referred to as S106.
S606, combining the second business index to obtain a fault influence range and possible reasons of faults, and completing the analysis of the first stage.
This step can refer to S108.
The analysis of the first stage can be completed by S601-S606. A second stage of analysis may then be performed.
S503, appointing the task type of the first dial testing task based on the analysis result of the first stage, and making a dial testing target to be issued to an appointed dial testing executor.
S504, acquiring a task execution result of the first dial testing task, capturing a first dial testing flow index of the first dial testing task, and combining the first business flow index to form input of a fault removal knowledge base.
S503 and S504 may refer to S110.
S505, the task execution result, the first dial testing flow index and the first service flow index can be checked, S506 is executed if the check is passed, S503 can be executed if the check is not passed, and the dial testing task is determined again.
The verification in this step may include integrity verification, matching verification, and the like. Specifically, the integrity check includes that the check is failed if any of the task execution result, the first dial testing flow index and the first service flow index is absent. The matching verification can be performed manually, the link can be manually intervened, and if the task execution result, the first dial measurement flow index and the conclusion indicated by the first service flow index are possibly left, the verification is not passed.
S506, obtaining a fault conclusion, root cause analysis and troubleshooting advice according to the obtained first service flow index, the dial testing task execution result and the first dial testing flow index matched with the built-in troubleshooting knowledge base.
This step can be referred to S112 and related embodiments.
S507, the analysis equipment issues a dial testing task to verify the dial testing conclusion according to the check conclusion.
S508, if the verification is passed, the step S509 is executed to output the checking result, otherwise, the step S502 is entered to carry out diagnosis again.
S507 and S508 may refer to S302-S310.
For example, assuming that the first traffic index from the source IP to the destination IP destination port includes the number of inflow packets and the number of outflow packets, and assuming that the number of inflow packets is equal to 0 and the number of outflow packets is equal to 0, then through S502, it may be determined that a fault occurs between the source IP and the collection point, and the trend of the number of inflow packets is analyzed to obtain the fault time point a. The second traffic index (number of inflow packets) changes before and after the point a is assumed to be counted as follows: source IP to destination IP other ports: non-0 to 0; source IP to destination IP network segment: non-0 to non-0; source IP to other IP: non-0 to non-0; other IP to destination IP:0 to 0; overall network: non-0 to non-0.
The first service flow index and the second service flow index are combined to analyze that the fault affects the access from the source IP to the destination IP; other IP to destination IP are not affected by the access traffic, and thus, it is necessary to supplement traffic data by dial testing. Possible causes of failure: some equipment between the source and the collection point may have no destination IP route, and some equipment makes policy to limit access to the destination IP.
Based on the previous assumption S503, a TCP dial testing task may be issued to the source IP dial testing program: the simulation source IP sends a TCP request to the destination IP from the same network segment address IP1, and the simulation source IP sends a TCP request to the destination IP from different network segment addresses IP 2; the TCP traceroute dial testing task may also be issued to the source IP dial testing program: the source IP performs TCP traceroute detection on the destination IP destination port.
Assume that the task execution result of the TCP dial testing task is: IP1 to destination IP destination port cannot communicate, IP2 to destination IP cannot communicate.
The task execution result of the TCP trace dial testing task is as follows: the network node IP addresses through which the source IP to destination IP destination port are returned are assumed to be IP3, IP4, IP5, IP6, IP 7. Wherein 7 after IP7 represent no response by the detected network node, and 7 after IP7 represent no response by the network node after IP7, indicating that the forwarding policy after IP7 needs to be examined.
The analysis equipment collects the flow from IP1 to destination IP and IP2 to destination IP, and identifies the flow as the dial-measured flow through TOS value. The statistical analysis can obtain the flow index of dial testing: the number of IP 1-destination IP inflow packets is equal to 0, and the number of IP 2-destination IP inflow packets is greater than 0.
In S506, the service flow index, the measurement task execution result, and the measurement flow index may be matched with the knowledge base to obtain a conclusion: the failure affects access of the source IP network segment to the destination IP.
Root cause analysis: the node (IP 7) egress forwarding policy or its next hop node ingress forwarding policy restricts access of the source IP network segment to the destination IP.
Investigation advice: please check the forwarding policy configuration of the node (IP 7) and the next hop node.
Based on the diagnosis result obtained in S506, in S507, a TCP dial testing task may be issued to the source IP dial testing program: the analog source IP sends a TCP request to the destination IP with the same network segment address IP10, and the analog source IP sends a TCP request to the destination IP with different network segment addresses IP 20.
The analysis equipment collects the flow from the IP10 to the destination IP and the flow from the IP20 to the destination IP, and identifies the flow as the dial-measured flow through the TOS value. And (3) obtaining a dial-measured flow index through statistical analysis: the number of IP10 to destination IP ingress packets is equal to 0, and the number of IP20 to destination IP ingress packets is greater than 0.
And outputting the checking conclusion after the verification is passed.
In the scheme shown in the above example, progressive investigation logic may be adopted to form investigation nesting, that is, the analysis result of the previous step affects the subsequent investigation logic to obtain possible reasons of failure, then the dial testing execution result and dial testing flow index corresponding to the possible reasons of failure are obtained by constructing the dial testing task corresponding to the possible reasons of failure, and then the final failure diagnosis result is obtained by inputting the dial testing execution result and dial testing flow index based on experience in combination with the service flow index.
Compared with the related art, at least the following technical effects are included:
first, the reliance on highly skilled talents is reduced. The automatic detection, localization and diagnosis of network faults can replace manual operation at least in part, and the dependence on high-skill network engineers is reduced, so that the working pressure of technicians can be reduced.
Second, the failure handling efficiency is improved. The automation system can rapidly and accurately analyze a large amount of network data, locate the root cause of the problem, complete fault diagnosis and have higher working efficiency than manual operation. The network service can be restored quickly, the service interruption time caused by faults is shortened to the maximum extent, and the enterprise loss is reduced.
Third, cure history experience. The obstacle-removing knowledge base is built on the basis of a large number of historical cases and knowledge summaries, can be reused in problem diagnosis, and can be continuously enriched and optimized. Compared with manual operation, the method can solidify experience, accumulate and share experience, and avoid knowledge loss.
Fourth, more closely follow real obstacle-removing logic, make the obstacle-removing more accurate.
Corresponding to the method embodiments of the application, the application also provides corresponding network fault diagnosis system embodiments.
Referring to fig. 7, fig. 7 is a system configuration diagram of a network fault diagnosis system shown in the present application. As shown in fig. 7, the network fault diagnosis system 700 includes:
a first obtaining unit 710 for obtaining a first traffic index of a first network line corresponding to a network fault to be diagnosed; the first service flow index is used for indicating the flow transmission condition in the first network line;
a first analysis unit 720, configured to analyze, according to the first traffic flow index, a failure area and a failure time point of the network failure;
a second obtaining unit 730, configured to obtain a change condition of a second traffic flow index of at least one preset second network line and/or an overall network before and after the fault time point; the second network line is associated with the first network line; the whole network is a network to which the first network line belongs; the second traffic flow index indicates a flow transmission condition in the second network line or the whole network;
The second analysis unit 740 is configured to analyze the fault influence range and possible fault reasons of the network fault to be diagnosed according to the change condition;
the dial testing unit 750 is configured to select at least one matched first dial testing task from preset dial testing tasks to execute according to the fault area, the fault influence range and the possible fault cause, and obtain a first execution result and a first dial testing flow index of each first dial testing task; the first dial testing flow index is used for indicating the flow transmission condition of the first dial testing task;
and the matching unit 760 is configured to query a pre-maintained obstacle avoidance knowledge base according to the first traffic flow index, the first execution result, and the first dial measurement flow index, so as to obtain a matched first fault diagnosis result.
In some embodiments, the first traffic flow index includes a first number of ingress packets and a first number of egress packets for which a destination in the first network line is an analysis object;
the first analysis unit 720, further:
under the condition that the number of the first inflow packets and the number of the first outflow packets are both a first preset value, determining a fault area of the network fault as a space between a source end of the first network line and a flow acquisition point; the flow acquisition point is positioned between the source end and the destination end;
Determining a fault area of the network fault as between the flow acquisition point and the destination end under the condition that the first inflow packet number is larger than the first preset value and the first outflow packet number is equal to the first preset value;
and under the condition that the number of the first inflow packets and the number of the first outflow packets are larger than the first preset value, determining a fault area of the network fault as a space between the flow acquisition point and the source end.
In some embodiments, the second network line comprises at least one of:
other ports from the source end to the destination end;
a network segment from the source end to the destination end;
the source end is connected with other destination ends except the destination end;
and the other source ends except the source end are connected to the destination end.
In some embodiments, in a case where the first number of incoming packets and the first number of outgoing packets are both a first preset value, the second traffic indicator is an incoming packet;
when the number of the first inflow packets is larger than the first preset value, and the number of the first outflow packets is equal to the first preset value, the second service flow index is the outflow packet;
and under the condition that the number of the first inflow packets and the number of the first outflow packets are both larger than the first preset value, the second service flow index is a new session or an active session.
In some embodiments, the dial testing unit 750, further:
determining a dial testing task type of a first dial testing task to be executed according to the fault influence range and the possible fault reason;
determining an executor of each type of the first dial testing task according to the fault area;
and sending each first dial testing task to a corresponding executor for execution.
In some embodiments, the troubleshooting knowledge base includes service flow index nodes with connection relation, dial testing task execution result nodes, dial testing flow index nodes, conclusion nodes, root cause nodes and troubleshooting suggestion nodes which are maintained based on historical experience;
the matching unit 760 further:
querying a target service flow index node matched with the first service flow index in the obstacle removing knowledge base, a target dial testing task execution result node corresponding to the first execution result, and a target dial testing flow index node corresponding to the first dial testing flow index;
determining a connection weight and a maximum target conclusion node in conclusion nodes connected with the target service flow index nodes and the target dial testing task execution result nodes;
Determining a connection weight and a maximum target root node in the root nodes connected with the target service flow index nodes and the target dial testing task execution result nodes;
determining a target investigation suggestion node with the largest connection weight from investigation suggestion nodes connected with the target root cause node;
and outputting the first fault diagnosis result by the target troubleshooting suggestion node according to the target conclusion node and the target root cause node.
In some embodiments, the system 700 further comprises a verification unit; the verification unit is used for:
after the first fault diagnosis result is obtained, selecting at least one matched second dial testing task from the preset dial testing tasks to execute according to the first fault diagnosis result;
acquiring a second execution result and a second dial testing flow index of each second dial testing task;
inquiring the obstacle removing knowledge base according to the second execution result and the second dial testing flow index to obtain a matched second fault diagnosis result;
outputting the first fault diagnosis result under the condition that the second execution result and the second dial testing flow index are matched with the first fault diagnosis result;
And carrying out fault diagnosis again under the condition that the second execution result is not matched with the first fault diagnosis result.
In some embodiments, the system 700 further comprises a first update module; the first updating module is configured to:
under the condition that the second execution result and the second dial testing flow index are matched with the first fault diagnosis result, adding the target conclusion node and the target service flow index node, wherein the target dial testing task execution result node and the weight between the target dial testing flow index nodes are increased;
and adding the target root cause node and the target service flow index node, wherein the target dial testing task execution result node is the weight between the target dial testing flow index nodes.
In some embodiments, the system 700 further comprises a second update module; the second updating module is configured to:
after the target root cause node outputs the first fault diagnosis result according to the target conclusion node, the target troubleshooting suggestion node receives feedback information of troubleshooting suggestions in the first fault diagnosis result;
under the condition that the feedback information is forward, increasing the weight between the target root cause node and the target investigation suggestion node;
And in the case that the feedback information is negative, reducing the weight between the target root node and the target investigation suggestion node, and/or determining a first investigation suggestion node with the weight inferior to that of the target investigation suggestion node in the investigation suggestion nodes connected with the target root node, and outputting investigation suggestions according to the first investigation suggestion node.
In the scheme illustrated by the embodiment of the system, progressive investigation logic can be adopted to form investigation nesting, namely, the analysis result of the previous step influences the subsequent investigation logic to obtain possible reasons of faults, then a dial testing execution result and dial testing flow index corresponding to the possible reasons of faults are obtained by constructing a dial testing task corresponding to the possible reasons of faults, and then a fault diagnosis knowledge base constructed based on experience is input by combining the service flow index to obtain the final fault diagnosis result.
Compared with the related art, at least the following technical effects are included:
first, the reliance on highly skilled talents is reduced. The automatic detection, localization and diagnosis of network faults can replace manual operation at least in part, and the dependence on high-skill network engineers is reduced, so that the working pressure of technicians can be reduced.
Second, the failure handling efficiency is improved. The automation system can rapidly and accurately analyze a large amount of network data, locate the root cause of the problem, complete fault diagnosis and have higher working efficiency than manual operation. The network service can be restored quickly, the service interruption time caused by faults is shortened to the maximum extent, and the enterprise loss is reduced.
Third, cure history experience. The obstacle-removing knowledge base is built on the basis of a large number of historical cases and knowledge summaries, can be reused in problem diagnosis, and can be continuously enriched and optimized. Compared with manual operation, the method can solidify experience, accumulate and share experience, and avoid knowledge loss.
Fourth, more closely follow real obstacle-removing logic, make the obstacle-removing more accurate.
One skilled in the relevant art will recognize that one or more embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present application may take the form of a computer program product on one or more computer-usable storage media (which may include, but are not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
"and/or" in this application means having at least one of the two. All embodiments in the application are described in a progressive manner, and identical and similar parts of all embodiments are mutually referred, so that each embodiment mainly describes differences from other embodiments. In particular, for data processing apparatus embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.
Although this application contains many specific implementation details, these should not be construed as limiting the scope of any disclosure or the scope of what is claimed, but rather as primarily describing features of certain disclosed embodiments. Certain features that are described in this application in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
The foregoing description of the preferred embodiment(s) of the present application is merely intended to illustrate the embodiment(s) of the present application and is not intended to limit the embodiment(s) of the present application, since any and all modifications, equivalents, improvements, etc. that fall within the spirit and principles of the embodiment(s) of the present application are intended to be included within the scope of the present application.

Claims (9)

1. A method of diagnosing a network failure, the method comprising:
acquiring a first service flow index of a first network line corresponding to a network fault to be diagnosed; the first service flow index is used for indicating the flow transmission condition in the first network line;
Analyzing a fault area and a fault time point of the network fault according to the first service flow index;
acquiring the change condition of a second service flow index of at least one preset second network line and/or an overall network before and after the fault time point; the second network line is the same as or associated with one of the following of the first network line: a source end and a destination end; the whole network is a network to which the first network line belongs; the second traffic flow index indicates a flow transmission condition in the second network line or the whole network;
analyzing the fault influence range and possible fault reasons of the network fault to be diagnosed according to the change condition;
determining a dial testing task type of a first dial testing task to be executed according to the fault influence range and the possible fault cause, determining executors of each type of the first dial testing task according to the fault area, sending each first dial testing task to a corresponding executor for execution, and obtaining a first execution result and a first dial testing flow index of each first dial testing task; the first dial testing flow index is used for indicating the flow transmission condition of the first dial testing task;
And inquiring a pre-maintained obstacle avoidance knowledge base according to the first service flow index, the first execution result and the first dial testing flow index to obtain a matched first fault diagnosis result.
2. The network failure diagnosis method according to claim 1, wherein the first traffic flow index includes a first inflow packet number and a first outflow packet number with a destination end in the first network line as an analysis object;
the analyzing the fault area of the network fault according to the first traffic flow index includes:
under the condition that the number of the first inflow packets and the number of the first outflow packets are both a first preset value, determining a fault area of the network fault as a space between a source end of the first network line and a flow acquisition point; the flow acquisition point is positioned between the source end and the destination end;
determining a fault area of the network fault as between the flow acquisition point and the destination end under the condition that the first inflow packet number is larger than the first preset value and the first outflow packet number is equal to the first preset value;
and under the condition that the number of the first inflow packets and the number of the first outflow packets are larger than the first preset value, determining a fault area of the network fault as a space between the flow acquisition point and the source end.
3. The network fault diagnosis method according to claim 2, wherein the second network line includes at least one of:
other ports from the source end to the destination end;
a network segment from the source end to the destination end;
the source end is connected with other destination ends except the destination end;
and the other source ends except the source end are connected to the destination end.
4. The network failure diagnosis method according to claim 3, wherein the second traffic flow index is an ingress packet in a case where the first number of ingress packets and the first number of egress packets are both a first preset value;
when the number of the first inflow packets is larger than the first preset value, and the number of the first outflow packets is equal to the first preset value, the second service flow index is the outflow packet;
and under the condition that the number of the first inflow packets and the number of the first outflow packets are both larger than the first preset value, the second service flow index is a new session or an active session.
5. The network fault diagnosis method according to claim 1, wherein the fault removal knowledge base comprises service flow index nodes with connection relation, dial testing task execution result nodes, dial testing flow index nodes, conclusion nodes, root cause nodes and troubleshooting suggestion nodes which are maintained based on historical experience;
Inquiring a pre-maintained obstacle avoidance knowledge base according to the first service flow index, the first execution result and the first dial measurement flow index to obtain a matched first fault diagnosis result, wherein the method comprises the following steps of:
querying a target service flow index node matched with the first service flow index in the obstacle removing knowledge base, a target dial testing task execution result node corresponding to the first execution result, and a target dial testing flow index node corresponding to the first dial testing flow index;
determining a connection weight and a maximum target conclusion node in conclusion nodes connected with the target service flow index nodes and the target dial testing task execution result nodes;
determining a connection weight and a maximum target root node in the root nodes connected with the target service flow index nodes and the target dial testing task execution result nodes;
determining a target investigation suggestion node with the largest connection weight from investigation suggestion nodes connected with the target root cause node;
and outputting the first fault diagnosis result by the target troubleshooting suggestion node according to the target conclusion node and the target root cause node.
6. The network fault diagnosis method according to claim 5, wherein after obtaining the first fault diagnosis result, the method further comprises:
according to the first fault diagnosis result, selecting at least one matched second dial testing task from preset dial testing tasks to execute;
acquiring a second execution result and a second dial testing flow index of each second dial testing task;
inquiring the obstacle removing knowledge base according to the second execution result and the second dial testing flow index to obtain a matched second fault diagnosis result;
outputting the first fault diagnosis result under the condition that the second execution result and the second dial testing flow index are matched with the first fault diagnosis result;
and carrying out fault diagnosis again under the condition that the second execution result is not matched with the first fault diagnosis result.
7. The network fault diagnosis method according to claim 6, wherein in case the second execution result and a second dial traffic index match the first fault diagnosis result, the method further comprises:
adding the target conclusion node and the target service flow index node, wherein the target dial testing task execution result node is provided with a weight between the target dial testing flow index nodes;
And adding the target root cause node and the target service flow index node, wherein the target dial testing task execution result node is the weight between the target dial testing flow index nodes.
8. The network fault diagnosis method according to claim 5, wherein after the target root node, the target troubleshooting advice node outputs the first fault diagnosis result according to the target conclusion node, the method further comprises:
receiving feedback information of the troubleshooting suggestion in the first fault diagnosis result;
under the condition that the feedback information is forward, increasing the weight between the target root cause node and the target investigation suggestion node;
and in the case that the feedback information is negative, reducing the weight between the target root node and the target investigation suggestion node, and/or determining a first investigation suggestion node with the weight inferior to that of the target investigation suggestion node in the investigation suggestion nodes connected with the target root node, and outputting investigation suggestions according to the first investigation suggestion node.
9. A network fault diagnosis system, the system comprising:
the first acquisition unit acquires a first service flow index of a first network line corresponding to a network fault to be diagnosed; the first service flow index is used for indicating the flow transmission condition in the first network line;
The first analysis unit is used for analyzing the fault area and the fault time point of the network fault according to the first service flow index;
the second acquisition unit acquires the change condition of at least one preset second network line and/or a second service flow index of the whole network before and after the fault time point; the second network line is the same as or associated with one of the following of the first network line: a source end and a destination end; the whole network is a network to which the first network line belongs; the second traffic flow index indicates a flow transmission condition in the second network line or the whole network;
the second analysis unit is used for analyzing the fault influence range and possible fault reasons of the network fault to be diagnosed according to the change condition;
the dial testing unit is used for determining dial testing task types of first dial testing tasks to be executed according to the fault influence range and the possible fault reasons, determining executors of each type of the first dial testing tasks according to the fault areas, sending each first dial testing task to a corresponding executor for execution, and obtaining a first execution result and a first dial testing flow index of each first dial testing task; the first dial testing flow index is used for indicating the flow transmission condition of the first dial testing task;
And the matching unit is used for inquiring a pre-maintained obstacle avoidance knowledge base according to the first service flow index, the first execution result and the first dial measurement flow index to obtain a matched first fault diagnosis result.
CN202311206984.4A 2023-09-19 2023-09-19 Network fault diagnosis method and system Active CN116938684B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311206984.4A CN116938684B (en) 2023-09-19 2023-09-19 Network fault diagnosis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311206984.4A CN116938684B (en) 2023-09-19 2023-09-19 Network fault diagnosis method and system

Publications (2)

Publication Number Publication Date
CN116938684A CN116938684A (en) 2023-10-24
CN116938684B true CN116938684B (en) 2023-12-26

Family

ID=88390147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311206984.4A Active CN116938684B (en) 2023-09-19 2023-09-19 Network fault diagnosis method and system

Country Status (1)

Country Link
CN (1) CN116938684B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103650569A (en) * 2013-07-22 2014-03-19 华为技术有限公司 Fault diagnosis method and device of wireless network
CN107171819A (en) * 2016-03-07 2017-09-15 北京华为数字技术有限公司 A kind of network fault diagnosis method and device
WO2020227985A1 (en) * 2019-05-15 2020-11-19 Alibaba Group Holding Limited Real-time fault detection on network devices and circuits based on traffic volume statistics
CN113810238A (en) * 2020-06-12 2021-12-17 中兴通讯股份有限公司 Network monitoring method, electronic device and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7480735B2 (en) * 2003-09-11 2009-01-20 Sun Microsystems, Inc. System and method for routing network traffic through weighted zones

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103650569A (en) * 2013-07-22 2014-03-19 华为技术有限公司 Fault diagnosis method and device of wireless network
CN107171819A (en) * 2016-03-07 2017-09-15 北京华为数字技术有限公司 A kind of network fault diagnosis method and device
WO2020227985A1 (en) * 2019-05-15 2020-11-19 Alibaba Group Holding Limited Real-time fault detection on network devices and circuits based on traffic volume statistics
CN113810238A (en) * 2020-06-12 2021-12-17 中兴通讯股份有限公司 Network monitoring method, electronic device and storage medium

Also Published As

Publication number Publication date
CN116938684A (en) 2023-10-24

Similar Documents

Publication Publication Date Title
US7636318B2 (en) Real-time network analyzer
CN112564964B (en) Fault link detection and recovery method based on software defined network
CN102868553B (en) Fault Locating Method and relevant device
EP2081321A2 (en) Sampling apparatus distinguishing a failure in a network even by using a single sampling and a method therefor
CN114157554B (en) Fault checking method and device, storage medium and computer equipment
WO2006028808A2 (en) Method and apparatus for assessing performance and health of an information processing network
CN109039763A (en) A kind of network failure nodal test method and Network Management System based on backtracking method
US20140355453A1 (en) Method and arrangement for fault analysis in a multi-layer network
CN113708995B (en) Network fault diagnosis method, system, electronic equipment and storage medium
CN113938407A (en) Data center network fault detection method and device based on in-band network telemetry system
CN111611146A (en) Micro-service fault prediction method and device
JP2008283621A (en) Apparatus and method for monitoring network congestion state, and program
CN113572656A (en) Method and device for flexibly combining inspection items of network equipment
CN111200544A (en) Network port flow testing method and device
CN111082979A (en) Intelligent substation process layer secondary circuit fault diagnosis method based on switch and fault diagnosis host
CN116938684B (en) Network fault diagnosis method and system
CN110609761B (en) Method and device for determining fault source, storage medium and electronic equipment
CN114553678B (en) Cloud network soft SLB flow problem diagnosis method
CN116466166A (en) Fault detection method, system and device for large-screen equipment
KR101074064B1 (en) Network traffic monitoring method and apparatus
CN114172796A (en) Fault positioning method and related device for communication network
CN108390790B (en) Fault diagnosis method and device for routing equipment
CN109088765B (en) Interconnection network routing fault diagnosis method and device
Lad et al. Inferring the origin of routing changes using link weights
CN112636944B (en) OLT equipment offline intelligent diagnosis method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant