CN113132140A

CN113132140A - Network fault detection method, device, equipment and storage medium

Info

Publication number: CN113132140A
Application number: CN201911416736.6A
Authority: CN
Inventors: 曹紫莹; 李诗逸; 古亮
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-07-16
Anticipated expiration: 2039-12-31
Also published as: CN113132140B

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for detecting network faults, wherein the method comprises the following steps: acquiring a network topology structure of a computer network; determining at least one detection path from a plurality of paths of the network topology; initiating packet loss detection on the at least one detection path based on a packet loss detection strategy, and determining at least one packet loss path with a packet loss fault in the at least one detection path; and positioning a fault point based on the at least one detection path and the at least one packet loss path, and determining a first type of fault node with a complete packet loss fault and a second type of fault node with a partial packet loss fault on the at least one packet loss path. Therefore, after the packet loss path is determined, the fault point in the packet loss path is further determined to be accurately positioned, and the subsequent network maintenance efficiency is improved.

Description

Network fault detection method, device, equipment and storage medium

Technical Field

The present application relates to computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting a network fault.

Background

After a computer network fails, certain influence is caused on the stability, reliability and performance of the network, and large-scale network paralysis is beyond the assumption. In addition, the network problem has the characteristics of diversity, complex causal relationship, instantaneity, unrepeatable scene and the like, so that the problem is not easy to find, troubleshoot and find out a fault point in many cases.

Packet loss in a computer network is a common network fault, and for the packet loss fault, some simple packet loss detection mechanisms exist in many existing service systems, but the problem that a fault point cannot be accurately located still exists. The existing detection tool does not have the capability of accurately positioning the fault point, and needs to assist positioning by an auxiliary tool, and even if the network problem is detected and the positioning is successfully performed by the auxiliary tool, the fault disappears at the moment, because the problem exists at the second point.

Disclosure of Invention

In order to solve the foregoing technical problems, embodiments of the present application desirably provide a method, an apparatus, a device, and a storage medium for detecting a network fault.

The technical scheme of the application is realized as follows:

in a first aspect, a method for detecting a network fault is provided, where the method includes:

acquiring a network topology structure of a computer network; the network topology structure comprises a plurality of paths composed of different network nodes;

determining at least one detection path from a plurality of paths of the network topology;

initiating packet loss detection on the at least one detection path based on a packet loss detection strategy, and determining at least one packet loss path with a packet loss fault in the at least one detection path;

and positioning a fault point based on the at least one detection path and the at least one packet loss path, and determining a first type of fault node with a complete packet loss fault and a second type of fault node with a partial packet loss fault on the at least one packet loss path.

In the foregoing solution, the performing fault point location based on the at least one detection path and the at least one packet loss path includes: acquiring a first path mapping relation and a first node mapping relation of the at least one detection path; the path mapping relation comprises a mapping relation from a path to a network node, and the node mapping relation comprises a mapping relation from a network node to a path; and determining the first type of fault node and the second type of fault node based on the at least one packet loss path, the first path mapping relation of the at least one detection path and the first node mapping relation.

In the foregoing solution, the determining the first type of faulty node and the second type of faulty node based on the at least one packet loss path, the first path mapping relationship of the at least one detection path, and the first node mapping relationship includes: determining network nodes corresponding to the at least one packet loss path based on the first path mapping relationship between the at least one packet loss path and the at least one detection path, and forming a packet loss suspected set by the network nodes corresponding to the at least one packet loss path; determining the first type of fault node with a complete packet loss fault based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and obtaining the second type of fault node after the first type of fault node is eliminated from the packet loss doubtful set.

In the foregoing solution, the determining, based on the packet loss suspected set and the first node mapping relationship of the at least one detection path, the first type of faulty node having a complete packet loss fault includes: determining a path corresponding to each suspected node in the packet loss suspected set based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and determining suspected nodes of which the paths are all packet loss paths to be first-class fault nodes based on the at least one packet loss path and the paths corresponding to the suspected nodes.

In the foregoing solution, the obtaining a network topology of a computer network includes: acquiring a second path mapping relation and a second node mapping relation of a plurality of paths in the network topology structure;

correspondingly, the determining at least one detection path from the plurality of paths of the network topology includes: based on a minimized path algorithm, simplifying a second path mapping relation and a second node mapping relation of a plurality of paths in the network topology structure, and determining the at least one detection path, and a first path mapping relation and a first node mapping relation of the at least one detection path.

In the foregoing solution, the initiating packet loss detection on the at least one detection path based on the packet loss detection policy includes: a first probe configured for path detection; if the detection condition is met, sending the first probe to a target detection path in the at least one detection path; receiving a second probe returned by the target detection path in response to the first probe; if the sending number of the first probes is not equal to the receiving number of the second probes, determining that the target path is a packet loss path; and if the sending number of the first probes is equal to the receiving number of the second probes, determining that the target path is a normal path.

In a second aspect, an apparatus for detecting network failure is provided, the apparatus comprising:

an acquisition unit for acquiring a network topology of a computer network; the network topology structure comprises a plurality of paths composed of different network nodes;

a processing unit for determining at least one detection path from a plurality of paths of the network topology;

a detecting unit, configured to initiate packet loss detection on the at least one detection path based on a packet loss detection policy, and determine at least one packet loss path in which a packet loss fault exists in the at least one detection path;

and the fault positioning unit is used for positioning a fault point based on the at least one detection path and the at least one packet loss path, and determining a first type of fault node with a complete packet loss fault and a second type of fault node with a partial packet loss fault on the at least one packet loss path.

In the foregoing solution, the fault location unit is specifically configured to obtain a first path mapping relationship and a first node mapping relationship of the at least one detection path; the path mapping relation comprises a mapping relation from a path to a network node, and the node mapping relation comprises a mapping relation from a network node to a path.

In the foregoing solution, the fault location unit is specifically configured to determine, based on the first path mapping relationship between the at least one packet loss path and the at least one detection path, a network node corresponding to the at least one packet loss path, and form a packet loss suspected set by the network node corresponding to the at least one packet loss path; determining the first type of fault node with a complete packet loss fault based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and obtaining the second type of fault node after the first type of fault node is eliminated from the packet loss doubtful set.

In the foregoing scheme, the fault location unit is specifically configured to determine, based on the packet loss suspected set and the first node mapping relationship of the at least one detection path, a path corresponding to each suspected node in the packet loss suspected set; and determining suspected nodes of which the paths are all packet loss paths to be first-class fault nodes based on the at least one packet loss path and the paths corresponding to the suspected nodes.

In the foregoing scheme, the obtaining unit is specifically configured to obtain a second path mapping relationship and a second node mapping relationship of multiple paths in the network topology;

correspondingly, the processing unit is specifically configured to reduce, based on a minimized path algorithm, a second path mapping relationship and a second node mapping relationship of multiple paths in the network topology, and determine the at least one detection path, and a first path mapping relationship and a first node mapping relationship of the at least one detection path.

In the above solution, the detection unit is specifically configured to construct a first probe for path detection; if the detection condition is met, sending the first probe to a target detection path in the at least one detection path; receiving a second probe returned by the target detection path in response to the first probe; if the sending number of the first probes is not equal to the receiving number of the second probes, determining that the target path is a packet loss path; and if the sending number of the first probes is equal to the receiving number of the second probes, determining that the target path is a normal path.

In a third aspect, a network device is provided, including: a processor and a memory configured to store a computer program operable on the processor, wherein the processor is configured to perform the steps of the aforementioned method when executing the computer program.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the aforementioned method.

By adopting the technical scheme, the network topology structure of the computer network is obtained; determining at least one detection path from a plurality of paths of the network topology; initiating packet loss detection on the at least one detection path based on a packet loss detection strategy, and determining at least one packet loss path with a packet loss fault in the at least one detection path; and positioning a fault point based on the at least one detection path and the at least one packet loss path, and determining a first type of fault node with a complete packet loss fault and a second type of fault node with a partial packet loss fault on the at least one packet loss path. Therefore, after the packet loss path is determined, the fault point in the packet loss path is further determined to be accurately positioned, and the subsequent network maintenance efficiency is improved.

Drawings

Fig. 1 is a schematic flow chart of a network fault detection method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of packet loss path detection in the embodiment of the present application;

fig. 3 is a second flowchart of the network fault detection method according to the embodiment of the present application;

fig. 4 is a schematic diagram of a third flow of a network fault detection method in an embodiment of the present application;

fig. 5 is a fourth flowchart illustrating a network fault detection method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a first process for fault point detection in an embodiment of the present application;

FIG. 7 is a second flowchart of the fault point detection in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a network fault detection apparatus in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a network device in the embodiment of the present application.

Detailed Description

So that the manner in which the features and elements of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.

Example one

An embodiment of the present application provides a network fault detection method, fig. 1 is a first flowchart of the network fault detection method in the embodiment of the present application, and as shown in fig. 1, the method may specifically include:

step 101: acquiring a network topology structure of a computer network; the network topology structure comprises a plurality of paths composed of different network nodes;

step 102: determining at least one detection path from a plurality of paths of the network topology;

step 103: initiating packet loss detection on the at least one detection path based on a packet loss detection strategy, and determining at least one packet loss path with a packet loss fault in the at least one detection path;

step 104: and positioning a fault point based on the at least one detection path and the at least one packet loss path, and determining a first type of fault node with a complete packet loss fault and a second type of fault node with a partial packet loss fault on the at least one packet loss path.

Here, the subject of performing the path detection of steps 101 to 104 may be a processor of the network node. The computer network includes network nodes such as hosts, switches, and routers, and the main body for performing path detection may be a host in the computer network.

The network topology refers to a physical configuration mode of points and lines formed by network nodes and transmission media in a computer network. There are two types of network nodes: one is a switching node for switching and exchanging information, which comprises a switch, a hub, a terminal controller and the like; the other type is an access node which comprises a computer host, a terminal and the like. Lines represent various transmission media, both tangible and intangible.

Each network topology is composed of network nodes, links, and paths.

1. A network node: also referred to as a network element, which is various data processing devices, data communication control devices, and data terminal devices in the network system. Common nodes are servers, workstations, line concentration and switches.

2. And link: the connection between two network nodes can be divided into a physical link and a logical link, wherein the former refers to an actually existing communication line, and the latter refers to a logically functional network path.

3. Path: refers to a series of nodes and links from a sending node that sends a message to a receiving node that receives the message, i.e., a series of node-to-node links established across a communication network.

The existing packet loss detection mechanism has large system resource occupation, for example, because many paths in the network topology structure have inclusion relationship, the system west garden network bandwidth pressure which occupies more correspondingly to detect all paths is large, and data packets do not need to be repeatedly transmitted and received on the included paths in the actual detection, so that the bandwidth pressure of the network is reduced, and the bandwidth utilization rate is improved.

In some embodiments, determining at least one detection path from the plurality of paths of the network topology specifically comprises: filtering out repeated paths in the network topology based on a path selection policy, and determining at least one detection path in the plurality of paths. Repeated detection paths are reduced based on the characteristics of the network topology structure, and the bandwidth pressure of the network can be relieved when network fault detection is carried out.

The network topology specifically includes: a second path mapping relationship and a second node mapping relationship of the plurality of paths; the path mapping relation comprises a mapping relation from a path to a network node, and the node mapping relation comprises a mapping relation from a network node to a path. Here, a path includes at least two network nodes, and the path mapping relationship specifically includes a mapping relationship between a path and at least two network nodes; the node mapping relation specifically comprises the mapping relation between one network node and one or more paths where the network node is located. In the embodiment of the present application, the second path mapping relationship includes mapping relationships between all paths in the network topology map and network nodes, and the second node mapping relationship includes mapping relationships between all nodes in the network topology map and paths. The first path mapping relation only includes a mapping relation between the detection path and the network node, and the first node mapping relation only includes a mapping relation between the network node located on the detection path and the detection path.

According to the two mapping relations between the paths and the network nodes in the network topology structure, the repeated paths in the network topology structure are determined and filtered to obtain the detection paths, and only the detection paths are subjected to fault detection, so that the bandwidth pressure of the network can be reduced, the occupation of system resources is reduced, and the real-time performance of detection is improved.

Specifically, based on a minimized path algorithm, simplifying a second path mapping relationship and a second node mapping relationship of a plurality of paths in the network topology structure, and determining the at least one detection path, and the first path mapping relationship and the first node mapping relationship of the at least one detection path. Here, the minimized path algorithm may be a minimum heap algorithm.

Further, packet loss detection is initiated on the at least one detection path based on a packet loss detection strategy, and at least one packet loss path with a packet loss fault in the at least one detection path is determined. In practical application, the network fault detection apparatus actively initiates packet loss detection on the at least one detection path when the detection condition is satisfied, for example, the detection condition is that a timing time arrives.

Here, the packet loss detection policy is used to detect whether a packet loss fault exists in the target path. For example, a part of network nodes in a computer network is used as a detection sender, and another part of network nodes is used as a detection receiver. For example, a sending end sends a first probe to a receiving end on a target path at regular time, the receiving end reconstructs the first probe after receiving the first probe to obtain a second probe, and the second probe is returned to a source sending end, so that the sending end determines whether a packet loss fault exists on the target path according to the quantity of the first probe sent and the data of the second probe received.

The existing packet loss detection mechanism does not have the capability of discovering transient network problems, is passively triggered and executed, is started only when the network problems occur, and does not have the capability of actively discovering faults. Just because there is great problem of resource occupation in current packet loss detection mechanism, current detection mechanism can't frequent detect global network, therefore some transient network problems can't just disappear by timely discovery, though this kind of network problems are transient, and the stable impact to computer network does not change a lot. The method and the device can actively initiate network fault detection, and have the capability of discovering transient network problems, so that the reliability of the network is ensured.

Fig. 2 is a schematic flow chart of packet loss path detection in the embodiment of the present application, and as shown in fig. 2, the method for detecting a packet loss path in step 103 includes the following steps:

step 201: a first probe configured for path detection;

step 202: if the detection condition is met, sending the first probe to a target detection path in the at least one detection path;

here, the detection condition may be satisfied by a timing time for sending the first probe to the target path, the timing time interval may be flexibly set according to a real-time requirement, the time interval is short when the real-time requirement is high, and the time interval is long when the real-time requirement is low.

The target path comprises a sending end and a receiving end, other network nodes can be arranged between the sending end and the receiving end, the sending end constructs a first probe for path detection, and the sending end sends the first probe to the receiving end through the target path; and the receiving terminal reconstructs the first probe after receiving the first probe to obtain a second probe, and returns the second probe to the source sending terminal. Therefore, the sending end determines whether a packet loss fault exists on the target path according to the quantity of the sent first probes and the quantity of the received second probe data.

Step 203: receiving a second probe returned by the target detection path in response to the first probe;

in practical application, a sending quantity parameter send _ pkts is set at a sending end and used for recording the quantity of first probes sent by the sending end, and a receiving quantity parameter recv _ pkts is set at a receiving end and used for recording the quantity of second probes sent by the receiving end. Further, whether a packet loss fault exists on a target detection path is judged according to the send _ pkts and the recv _ pkts, and if the send _ pkts is not equal to the recv _ pkts, the target path is determined to be a packet loss path; and if the send _ pkts is equal to the recv _ pkts, determining that the target path is a normal path.

Step 204: if the sending number of the first probes is not equal to the receiving number of the second probes, determining that the target path is a packet loss path;

step 205: and if the sending number of the first probes is equal to the receiving number of the second probes, determining that the target path is a normal path.

It should be noted that, the above is only an exemplary judgment condition for the packet loss path, and in practical application, the judgment condition may be appropriately adjusted according to the packet loss tolerance. For example, when the difference between the sending number and the receiving number of the probes is within the range of the packet loss tolerance, the normal path is determined; and when the difference value between the sending number and the receiving number of the probes exceeds the range of the packet loss tolerance, determining the probe as a packet loss path.

Further, according to the determined at least one packet loss path and the at least one detection path, a first type of fault node with a complete packet loss fault and a second type of fault node with a partial packet loss fault on the at least one packet loss path are determined.

The first type of fault node with complete packet loss fault refers to that all paths passing through the node are packet loss paths, and the second type of fault node with partial packet loss fault refers to that one part of the paths passing through the node is a packet loss path and the other part of the paths passing through the node is a normal path.

For the first-class fault node, the probability of path packet loss is high due to the fact that the node has a fault, and the first-class fault node can be passively detected based on the result to confirm the root cause of the packet loss problem. For the second type of failed nodes, the probability of path packet loss is low due to the existence of a failure in the node itself, and the passive detection of the second type of failed nodes is not needed, or only the nodes meeting the conditions in the second type of failed nodes are passively detected, for example, the passive detection of the second type of failed nodes in which the ratio of the packet loss path to the total path is greater than the preset ratio is performed.

By adopting the technical scheme, after the packet loss path is determined, the fault point in the packet loss path is further determined for accurate positioning, and the subsequent network maintenance efficiency is improved.

On the basis of the foregoing embodiment, a more detailed network fault detection method is further provided, fig. 3 is a schematic diagram of a second process of the network fault detection method in the embodiment of the present application, and as shown in fig. 3, the method includes:

step 301: acquiring a second path mapping relation and a second node mapping relation of a plurality of paths in the network topology structure;

in practical application, a topology structure of a computer network is established, and the network topology structure includes a mapping relationship between a path and a network node, and specifically may include a second path mapping relationship from the path to the network node and a second node mapping relationship from the network node to the path.

Because many paths in the network topology structure have inclusion relations, the data packets do not need to be repeatedly transmitted and received on the included paths, so that the bandwidth pressure of the network is reduced, and the bandwidth utilization rate is improved.

Step 302: based on a minimized path algorithm, simplifying a second path mapping relation and a second node mapping relation of a plurality of paths in the network topology structure, and determining the at least one detection path, and a first path mapping relation and a first node mapping relation of the at least one detection path;

according to the mapping relation between the paths and the network nodes in the network topology structure, repeated paths in the network topology structure are filtered, the bandwidth pressure of the network can be reduced, the occupation of system resources is reduced, and the real-time performance of detection is improved.

Here, the minimized path algorithm may be a minimum heap algorithm.

Step 303: initiating packet loss detection on the at least one detection path based on a packet loss detection strategy, and determining at least one packet loss path with a packet loss fault in the at least one detection path;

step 304: and determining a first type of fault node and a second type of fault node based on the at least one packet loss path, the first path mapping relation of the at least one detection path and the first node mapping relation.

Here, the first type of failed node having a complete packet loss fault means that all paths passing through the node are packet loss paths, and the second type of failed node having a partial packet loss fault means that one part of the paths passing through the node is a packet loss path and the other part is a normal path.

In some embodiments, step 304 specifically includes: determining network nodes corresponding to the at least one packet loss path based on the first path mapping relationship between the at least one packet loss path and the at least one detection path, and forming a packet loss suspected set by the network nodes corresponding to the at least one packet loss path; determining the first type of fault node with a complete packet loss fault based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and obtaining the second type of fault node after the first type of fault node is eliminated from the packet loss doubtful set.

That is, determining that a network node corresponding to a packet loss path obtains a suspected packet loss set according to the packet loss path and a mapping relation between the path and the network node; determining a first type of fault node according to the packet loss path and the mapping relation from the network node to the path; and removing the second type of fault nodes from the packet loss doubtful set to obtain the second type of fault nodes.

Specifically, determining the first type of faulty node with a complete packet loss fault based on the packet loss suspected set and the first node mapping relationship of the at least one detection path includes: determining a path corresponding to each suspected node in the packet loss suspected set based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and determining suspected nodes of which the paths are all packet loss paths to be first-class fault nodes based on the at least one packet loss path and the paths corresponding to the suspected nodes.

In practical application, the suspect node set may also include normal nodes. The normal node means that all paths passing through the node are normal paths.

Correspondingly, the method further comprises the following steps: determining a normal node without a packet loss fault based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and after the first type of fault node is eliminated from the packet loss suspected set, a normal node is eliminated to obtain a second type of fault node.

The specific method for determining the normal node comprises the following steps: determining a path corresponding to each suspected node in the packet loss suspected set based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and determining the suspected node without the packet loss path as a normal node based on the at least one packet loss path and the path corresponding to the suspected node.

By adopting the technical scheme, repeated detection paths are reduced based on the characteristics of the network topology structure so as to reduce the bandwidth pressure of the network, network faults can be found in time through active detection, the reliability of the network is ensured, and fault points of network problems can be accurately positioned, so that the network maintenance efficiency is improved.

Fig. 4 is a schematic diagram of a third flow of the network fault detection method in the embodiment of the present application, and as shown in fig. 4, the fault detection method is implemented by three threads in process up, where the three threads respectively include: pinger, reciver and upong. This process and three threads exist at all agent ends of the active instrumentation, which form the basic framework of the active instrumentation. The division of the four objects is as follows:

and (3) uping: the method is responsible for creating a pinger thread and a receiver thread, sharing a virtual address space with the threads, generating a configuration file, receiving feedback of the receiver regularly and obtaining a return result of the receiver.

pinger: the sending thread is responsible for constructing the content of the probe for active detection, determining a sending end port and a receiving end port according to the configuration file generated by uping, and sending a first probe for active detection to the receiving end at regular time;

receiver: and the receiving thread is responsible for receiving the second probe after the upong reconstruction, analyzing the content of the probe, and reporting the analyzed content to the uping process regularly or quantitatively.

upong: although each agent has an upong, the first probe at the sending end is received, and in a special case, key acquisition information is added in a probe reserved bit to indicate other network faults, such as whether network congestion occurs or not. And replying to the source host again after the probe content is reconstructed, so that the receiver thread of the sending end host receives the probe content, and a data transfer function is realized.

The ports 1, 2 and 3 in fig. 4 are ports for data interaction between hosts and for active probing, and these ports can indicate from which the probe of the active probing tool is sent, who receives the response, and who receives the response.

The process up further includes a link _ failure _ localization thread for locating a failure point.

The network failure detection method is further explained below.

Fig. 5 is a fourth flowchart of the network fault detection method in this embodiment, where the agent end 1 and the agent end 2 in fig. 5 are respectively located at different network nodes, for example, the network nodes are hosts, and the content of the agent end 1 and the agent end 2 is the same, where data communication may be performed between a main process (run _ main) and a sub process (active _ detector), data sharing may be performed between the active _ detector and four threads (ping, receiver, up, and link _ failure _ localization), and a dashed box represents active detection packet transceiving, content acquisition, and update between different hosts.

Firstly, agent1 will start the run _ main process, which represents the entry for all network failure problem troubleshooting, and at this time, the sub-process active _ detector of the active detection function will be started on the basis of run _ main, and there will be three threads for actively detecting the send-receive packet, which are ping, receiver and up, respectively, and the functions of these three threads are explained above, and there is also one thread link _ failure _ localization which is the execution logic of the link packet loss failure localization algorithm for the active detection result. The active _ detector sub-process and the pinger and receiver threads have a shared variable cfg _ loss, which is a python dictionary structure, where the key of the dictionary is pathID (IP combination of agent1 and agent2), the value corresponding to the key is a python tuple (send _ pkts, receiver _ pkts), and the send _ pkts represents the number of packets sent on the path of pathID (e.g., agent1- > agent2), and similarly, the receiver _ pkts represents the number of packets returned by agent2 received by agent 1.

After the three threads are started, a pinger in the agent1 reads the host IP where the agent1 is located in the data center network, sends a constructed active probing data packet to a target host (the host where the agent2 is located) according to a certain rule, the pinger counts and adds one to the sent number send in cfg.g _ loss [ pathID ] when sending the data packet, the received number recv is unchanged, then the uplink of the agent2 receives the active probing data packet recv _ probe sent by the pinger of the agent1, analyzes the data packet according to a preset format, extracts the content of the data packet, knows the IP and the port of the source (agent1) of the data packet, reconstructs the recv _ probe through the uplink thread to obtain a reply _ probe, and returns the reply _ probe to an evener thread in 1, the reply _ probe thread receives the data packet can also send the reply _ probe to the agent thread according to the preset format 35error port, the reply thread also knows the number of the reply _ probe 35id (35id) of the IP _ probe, and the port is not known in the preset format (2), the received number recv is counted up by one, thus completing the active probing process from agent1 to agent 2. The active probing process of agent2 through agent1 is the reverse of that described above.

The pathID is used as a unique standard for distinguishing links, and we can clearly know which link has lost the packet (recv _ pkts is not equal to send _ pkts). In addition, the active _ detector process may agree with a time, for example, 30s, to obtain the active detection result, that is, the content in the cfg _ loss variable, and transmit the obtained result to the link _ failure _ localization thread for use in the next link packet loss fault location logic, where the link _ failure _ localization executes the algorithm logic according to the transmitted cfg _ g _ loss, and finally locates the specific location of the packet loss in the link, such as the switch, the router, the host side, and the like, and transmits the specific location to the run _ main parent process through the active _ context content for further passive detection to confirm the root cause of the packet loss problem.

Fig. 6 is a schematic diagram of a first process of fault point detection in an embodiment of the present application, where a dashed box in fig. 6 represents implementation logic of a data center network fault location algorithm, a solid box represents an implementation function, an arrow direction represents a function parameter transmission process, and parameters around the arrow direction represent function parameter transmission parameters. The path matrix (path _ matrix) represents a mapping relationship between pathids and linkids, where pathids are generated by network topology rules, and assuming that there are 10 agents in the data center network, n paths are generated in the topology according to the network topology rules set by a user, each pathID corresponds to one of the n paths, and is used as a unique identifier of a network path, and linkids are unique identifiers of network nodes in the network topology, such as agent hosts, switches, routers, and other devices. Therefore, there are multiple devices (linkids) on one path (pathID). And a path passing through one node (linkID) also has a plurality of paths (pathids), which is a link matrix (link _ matrix). The loss _ matrix is obtained by summarizing the difference between send _ pkt and recv _ pkt in cfg.g _ loss result in the active probing process.

Firstly, reading a network topology structure of a data center network by obtaining a network topology rule (get _ topology _ rule) function, knowing the situation of the network topology, and because many paths in the network topology have an inclusion relationship, there is no need to repeatedly send and receive data packets on the included paths, so that the bandwidth pressure of the data center network can be reduced, and the bandwidth utilization rate of links can be reduced, where the function of minimizing paths is implemented by using a minimum _ heap algorithm (minimum _ heap _ algorithm), a simplified path matrix (select _ path _ matrix) and a simplified link matrix (select _ link _ matrix) can be deduced by the minimum heap algorithm, a suspected node set (get _ span _ links) operation is performed in combination with a loss matrix (loss _ matrix), where a path id set in the path matrix can be known, this indicates which paths have the occurrence of the packet loss event, the corresponding link ids with the suspected packet loss are obtained by obtaining the pathID set in the loss _ matrix and the packet loss matrix (select _ path _ matrix), the set formed by these link ids is the suspected packet loss set (loss _ suspects _ links), then an operation of obtaining the suspected complete packet loss set (get _ full _ suspects _ links) is performed, whether the path (pathID) through which the link (linkID) passes is the complete packet loss can be known by the pathID in the link matrix (select _ link _ matrix) and the packet loss matrix (loss _ suspects _ links), and it is determined that the suspected packet loss set (loss _ suspects _ links) in the loss _ links is the suspected packet loss set and the suspected packet loss set (loss _ suspects _ links) in the loss _ suspects. At this time, it is already obvious that which devices (linkids) in the current data center network may have problems, which causes the occurrence of complete packet loss and partial packet loss of the path (pathID). And the equipment problem can be further checked by combining with a passive detection tool, and a solution suggestion is given.

Fig. 7 is a second flowchart of fault point detection in the embodiment of the present application, and fig. 7 exemplarily shows a path matrix and a link matrix, a simplified path matrix and link matrix, a packet loss suspected set, a complete packet loss suspected set, and a partial packet loss suspected set.

Wherein the path matrix is

The path matrix after the reduction of the minimum heap algorithm is

Here, a2 is less than or equal to a1, b2 is less than or equal to b1, c2 is less than or equal to c1, index is a path index, and value is a node set corresponding to the path.

The link matrix is

The link matrix after the reduction of the minimum heap algorithm is

Here, d2 is less than or equal to d1, e2 is less than or equal to e1, f2 is less than or equal to f1, index is a node index, and value is a path set corresponding to the node.

Packet loss matrix

Here, the key value is a path index, and the value is the transmission number and the reception number of the path.

A loss _ subset _ links set { linkID 1.. linkIDg };

full _ loss _ subset _ links { linkID 1.. link1Dg1 };

partial _ loss _ subset _ links { linkID1, … linkIDg2} is suspected to be aggregated in the partial packet loss;

here, g1+ g2 is less than or equal to g.

The network fault detection method provided in the embodiment of the application mainly makes up for four defects of the existing link detection tool, and the light weight of the data section of the probe effectively reduces the resource loss (bandwidth, cpu resources and the like) of the system; meanwhile, repeated detection paths are reduced based on network topological characteristics, the bandwidth pressure of a data center network is obviously reduced, the data center network can be used as a resident process in the background due to the light weight characteristic, the data center network is globally monitored, the initiative is realized, the data center network can be relatively independent from a service system, and the starting of the data center network is triggered without specific network problem outbreak, so that the real-time performance of an active detection tool is also exerted, a plurality of transient network problems can be found, and besides, the active detection result can be analyzed by combining a fault positioning algorithm to accurately position the fault point of the network problem.

Example two

An embodiment of the present application further provides a network fault detection apparatus, and as shown in fig. 8, the apparatus includes:

an obtaining unit 801, configured to obtain a network topology of a computer network; the network topology structure comprises a plurality of paths composed of different network nodes;

a processing unit 802 for determining at least one detection path from a plurality of paths of the network topology;

a detecting unit 803, configured to initiate packet loss detection on the at least one detection path based on a packet loss detection policy, and determine at least one packet loss path in the at least one detection path where a packet loss fault exists;

a fault locating unit 804, configured to perform fault point location based on the at least one detection path and the at least one packet loss path, and determine a first type of fault node where a complete packet loss fault exists and a second type of fault node where a partial packet loss fault exists on the at least one packet loss path.

In some embodiments, the fault location unit 804 is specifically configured to obtain a first path mapping relationship and a first node mapping relationship of the at least one detection path; the path mapping relation comprises a mapping relation from a path to a network node, and the node mapping relation comprises a mapping relation from a network node to a path.

In some embodiments, the fault locating unit 804 is specifically configured to determine, based on the first path mapping relationship between the at least one packet loss path and the at least one detection path, a network node corresponding to the at least one packet loss path, and form a packet loss suspected set by the network node corresponding to the at least one packet loss path; determining the first type of fault node with a complete packet loss fault based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and obtaining the second type of fault node after the first type of fault node is eliminated from the packet loss doubtful set.

In some embodiments, the fault locating unit 804 is specifically configured to determine, based on the packet loss suspected set and the first node mapping relationship of the at least one detection path, a path corresponding to each suspected node in the packet loss suspected set; and determining suspected nodes of which the paths are all packet loss paths to be first-class fault nodes based on the at least one packet loss path and the paths corresponding to the suspected nodes.

In some embodiments, the obtaining unit 801 is specifically configured to obtain a second path mapping relationship and a second node mapping relationship of multiple paths in the network topology;

correspondingly, the processing unit 802 is specifically configured to, based on a minimization path algorithm, reduce a second path mapping relationship and a second node mapping relationship of multiple paths in the network topology, and determine the at least one detection path, and a first path mapping relationship and a first node mapping relationship of the at least one detection path.

In some embodiments, the detection unit 803, in particular for configuring a first probe for path detection; if the detection condition is met, sending the first probe to a target detection path in the at least one detection path; receiving a second probe returned by the target detection path in response to the first probe; if the sending number of the first probes is not equal to the receiving number of the second probes, determining that the target path is a packet loss path; and if the sending number of the first probes is equal to the receiving number of the second probes, determining that the target path is a normal path.

An embodiment of the present application further provides a network device, as shown in fig. 9, where the network device includes: a processor 901 and a memory 902 configured to store a computer program capable of running on the processor; the processor 901 realizes the following steps when running the computer program in the memory 902:

In some embodiments, the processor 901, when running the computer program in the memory 902, implements the following steps: acquiring a first path mapping relation and a first node mapping relation of the at least one detection path; the path mapping relation comprises a mapping relation from a path to a network node, and the node mapping relation comprises a mapping relation from a network node to a path; and determining the first type of fault node and the second type of fault node based on the at least one packet loss path, the first path mapping relation of the at least one detection path and the first node mapping relation.

In some embodiments, the processor 901, when running the computer program in the memory 902, implements the following steps: determining network nodes corresponding to the at least one packet loss path based on the first path mapping relationship between the at least one packet loss path and the at least one detection path, and forming a packet loss suspected set by the network nodes corresponding to the at least one packet loss path; determining the first type of fault node with a complete packet loss fault based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and obtaining the second type of fault node after the first type of fault node is eliminated from the packet loss doubtful set.

In some embodiments, the processor 901, when running the computer program in the memory 902, implements the following steps: determining a path corresponding to each suspected node in the packet loss suspected set based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and determining suspected nodes of which the paths are all packet loss paths to be first-class fault nodes based on the at least one packet loss path and the paths corresponding to the suspected nodes.

In some embodiments, the processor 901, when running the computer program in the memory 902, implements the following steps: acquiring a second path mapping relation and a second node mapping relation of a plurality of paths in the network topology structure; based on a minimized path algorithm, simplifying a second path mapping relation and a second node mapping relation of a plurality of paths in the network topology structure, and determining the at least one detection path, and a first path mapping relation and a first node mapping relation of the at least one detection path.

In some embodiments, the processor 901, when running the computer program in the memory 902, implements the following steps: a first probe configured for path detection; if the detection condition is met, sending the first probe to a target detection path in the at least one detection path; receiving a second probe returned by the target detection path in response to the first probe; if the sending number of the first probes is not equal to the receiving number of the second probes, determining that the target path is a packet loss path; and if the sending number of the first probes is equal to the receiving number of the second probes, determining that the target path is a normal path.

Of course, in actual practice, the various components in the network device are coupled together by a bus system 903, as shown in FIG. 9. It is understood that the bus system 903 is used to enable communications among the components. The bus system 903 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as the bus system 903 in FIG. 9.

It should be noted that the network device in the embodiments of the present application is equivalent to a network node.

The embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method according to any of the embodiments.

In practical applications, the processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the above processor functions may be other devices, and the embodiments of the present application are not limited in particular.

The Memory may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD), or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor.

It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for network fault detection, the method comprising:

2. The method of claim 1, wherein the performing fault point location based on the at least one detection path and the at least one packet loss path comprises:

acquiring a first path mapping relation and a first node mapping relation of the at least one detection path; the path mapping relation comprises a mapping relation from a path to a network node, and the node mapping relation comprises a mapping relation from a network node to a path;

and determining the first type of fault node and the second type of fault node based on the at least one packet loss path, the first path mapping relation of the at least one detection path and the first node mapping relation.

3. The method according to claim 2, wherein the determining the first type of failed node and the second type of failed node based on the at least one packet loss path, the first path mapping relationship of the at least one detection path, and the first node mapping relationship comprises:

determining network nodes corresponding to the at least one packet loss path based on the first path mapping relationship between the at least one packet loss path and the at least one detection path, and forming a packet loss suspected set by the network nodes corresponding to the at least one packet loss path;

determining the first type of fault node with a complete packet loss fault based on the packet loss suspected set and the first node mapping relation of the at least one detection path;

and obtaining the second type of fault node after the first type of fault node is eliminated from the packet loss doubtful set.

4. The method according to claim 3, wherein the determining that the first type of failed node with a complete packet loss fault exists based on the suspected set of packet loss and the first node mapping relationship of the at least one detection path includes:

determining a path corresponding to each suspected node in the packet loss suspected set based on the packet loss suspected set and the first node mapping relation of the at least one detection path;

and determining suspected nodes of which the paths are all packet loss paths to be first-class fault nodes based on the at least one packet loss path and the paths corresponding to the suspected nodes.

5. The method of claim 2, wherein obtaining the network topology of the computer network comprises:

acquiring a second path mapping relation and a second node mapping relation of a plurality of paths in the network topology structure;

correspondingly, the determining at least one detection path from the plurality of paths of the network topology includes:

based on a minimized path algorithm, simplifying a second path mapping relation and a second node mapping relation of a plurality of paths in the network topology structure, and determining the at least one detection path, and a first path mapping relation and a first node mapping relation of the at least one detection path.

6. The method of claim 1, wherein the initiating packet loss detection for the at least one detection path based on a packet loss detection policy comprises:

a first probe configured for path detection;

if the detection condition is met, sending the first probe to a target detection path in the at least one detection path;

receiving a second probe returned by the target detection path in response to the first probe;

if the sending number of the first probes is not equal to the receiving number of the second probes, determining that the target path is a packet loss path;

and if the sending number of the first probes is equal to the receiving number of the second probes, determining that the target path is a normal path.

7. An apparatus for network fault detection, the apparatus comprising:

8. The apparatus according to claim 7, wherein the fault location unit is specifically configured to obtain a first path mapping relationship and a first node mapping relationship of the at least one detection path; the path mapping relationship comprises a mapping relationship from a path to a network node, the node mapping relationship comprises a mapping relationship from a network node to a path, and the first type fault node and the second type fault node are determined based on the at least one packet loss path, the first path mapping relationship and the first node mapping relationship of the at least one detection path.

9. The apparatus according to claim 8, wherein the fault location unit is specifically configured to determine, based on the first path mapping relationship between the at least one packet loss path and the at least one detection path, a network node corresponding to the at least one packet loss path, and form a suspected packet loss set from the network node corresponding to the at least one packet loss path; determining the first type of fault node with a complete packet loss fault based on the packet loss suspected set and the first node mapping relation of the at least one detection path; and obtaining the second type of fault node after the first type of fault node is eliminated from the packet loss doubtful set.

10. The apparatus according to claim 9, wherein the fault location unit is specifically configured to determine, based on the packet loss suspected set and a first node mapping relationship of the at least one detection path, a path corresponding to each suspected node in the packet loss suspected set; and determining suspected nodes of which the paths are all packet loss paths to be first-class fault nodes based on the at least one packet loss path and the paths corresponding to the suspected nodes.

11. The apparatus according to claim 8, wherein the obtaining unit is specifically configured to obtain a second path mapping relationship and a second node mapping relationship of a plurality of paths in the network topology;

12. The device according to claim 7, characterized in that the detection unit, in particular for configuring a first probe for path detection; if the detection condition is met, sending the first probe to a target detection path in the at least one detection path; receiving a second probe returned by the target detection path in response to the first probe; if the sending number of the first probes is not equal to the receiving number of the second probes, determining that the target path is a packet loss path; and if the sending number of the first probes is equal to the receiving number of the second probes, determining that the target path is a normal path.

13. A network device, the network device comprising: a processor and a memory configured to store a computer program capable of running on the processor,

wherein the processor is configured to perform the steps of the method of any one of claims 1 to 6 when running the computer program.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.