CN110493042B - Fault diagnosis method and device and server - Google Patents

Fault diagnosis method and device and server Download PDF

Info

Publication number
CN110493042B
CN110493042B CN201910757406.7A CN201910757406A CN110493042B CN 110493042 B CN110493042 B CN 110493042B CN 201910757406 A CN201910757406 A CN 201910757406A CN 110493042 B CN110493042 B CN 110493042B
Authority
CN
China
Prior art keywords
node
fault
alarm
topology
alarm information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910757406.7A
Other languages
Chinese (zh)
Other versions
CN110493042A (en
Inventor
李思维
邱依强
孙亚东
黄存峰
范纪明
韦巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201910757406.7A priority Critical patent/CN110493042B/en
Publication of CN110493042A publication Critical patent/CN110493042A/en
Application granted granted Critical
Publication of CN110493042B publication Critical patent/CN110493042B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies

Abstract

The invention provides a fault diagnosis method, a fault diagnosis device and a server. The fault diagnosis method of the invention comprises the steps of establishing the topology of a network system; acquiring alarm information in a network system; associating the alarm information with the nodes in the topology to determine a fault link; and determining a fault root node in the fault link according to a preset fault probability calculation rule. The method determines the fault link by associating the alarm information with the topology of the network system, thereby rapidly acquiring the service influence range and degree caused by the alarm, and determining the fault root node in the fault link according to the preset rule, thereby realizing the automatic positioning and accurate positioning of the fault and improving the efficiency of fault processing.

Description

Fault diagnosis method and device and server
Technical Field
The present invention relates to network technologies, and in particular, to a fault diagnosis method, apparatus, and server.
Background
In various network systems, a large number of different types of network equipment exist, alarm information can be generated when the network equipment fails, and maintenance personnel can check the alarm information and timely find and process the faults existing in the system.
In the existing alarm processing system, usually, each alarm generated on each device is sent out an alarm message of the device, and if the alarm is frequently given out on each device due to the repeated reason, a plurality of repeated alarms are combined to send out an alarm message of the device.
For a network system with a complex structure and a large number of devices, because the alarm information sent by each device is not associated with each other, the operation condition of other devices associated with the device cannot be determined through the alarm information of one device, and when a maintenance worker locates a fault, the maintenance worker often needs to check a large amount of alarm information of a plurality of devices to determine the root cause of the fault, which is relatively low in efficiency.
Disclosure of Invention
The invention provides a fault diagnosis method, a fault diagnosis device and a server, which are used for quickly determining a fault root node in a network system and improving the fault positioning efficiency.
The invention provides a fault diagnosis method, which comprises the following steps:
establishing the topology of a network system;
acquiring alarm information in a network system;
associating the alarm information with the nodes in the topology to determine a fault link;
and determining a fault root node in the fault link according to a preset fault probability calculation rule.
Optionally, the associating the alarm information with the node in the topology to determine the failed link includes:
the alarm information is associated with the nodes in the topology, and the alarm nodes in the topology are determined;
and determining the link formed by the alarm node and the father node and/or the child node of the alarm node as a fault link.
Optionally, the determining a failure root node in the failed link according to a preset failure probability calculation rule includes:
calculating the fault probability of each node according to the alarm information of each node in the fault link, the fault probability of the child node of each node and the number of the child nodes of each node;
respectively judging whether the fault probability of each node is greater than or equal to the fault threshold corresponding to each node;
if the fault probability of the first node is larger than or equal to the fault threshold corresponding to the first node, adding the first node into a candidate list;
and determining the node with the highest level in the candidate list as the fault root node.
Optionally, the calculating the failure probability of each node according to the alarm information of each node in the failed link, the failure probability of the child node of each node, and the number of child nodes includes:
calculating the failure probability of each node in the failure link according to the following formula:
Figure BDA0002169205470000021
wherein, P is the fault probability of each node; a is 0 or 1, wherein if each node has alarm information, A is 1, and if each node has no alarm information, A is 0; sum is the sum of the failure probabilities of the child nodes of each node; the count is the number of child nodes of each node.
Optionally, the establishing a topology of the network system includes:
acquiring network element information of each device in a network system within a preset time period, wherein the network element information of each device comprises a hierarchy of each device in the network system;
performing upper and lower level association on the network element information of different types of equipment;
and generating a visual topology according to the associated network element information.
Optionally, the performing upper and lower level association on the network element information of the different types of devices includes:
and determining the next-level equipment of each equipment layer by layer from the equipment at the highest level to the lowest level by adopting a traversal method so as to perform upper-level and lower-level association on the network element information of different types of equipment.
Optionally, the method further includes:
and merging the alarm information of the fault root node and the child nodes of the fault root node to carry out associated alarm.
The invention provides a fault diagnosis device, comprising:
the establishing module is used for establishing the topology of the network system;
the acquisition module is used for acquiring alarm information in a network system;
the association module is used for associating the alarm information with the nodes in the topology and determining a fault link;
and the determining module is used for determining a fault root node in the fault link according to a preset fault probability calculation rule.
Optionally, the association module is configured to:
associating the alarm information with the nodes in the topology, and determining the alarm nodes in the topology;
and determining the link formed by the alarm node and the father node and/or the child node of the alarm node as a fault link.
Optionally, the determining module is configured to:
calculating the fault probability of each node according to the alarm information of each node in the fault link, the fault probability of the child node of each node and the number of the child nodes of each node;
respectively judging whether the fault probability of each node is greater than or equal to the fault threshold corresponding to each node;
if the fault probability of the first node is larger than or equal to the fault threshold corresponding to the first node, adding the first node into a candidate list;
and determining the node with the highest level in the candidate list as the fault root node.
Optionally, the determining module is specifically configured to:
calculating the failure probability of each node in the failed link according to the following formula:
Figure BDA0002169205470000031
wherein, P is the failure probability of each node; a is 0 or 1, wherein if each node has alarm information, A is 1, and if each node has no alarm information, A is 0; sum is the sum of the failure probabilities of the child nodes of each node; the count is the number of child nodes of each node.
Optionally, the establishing module is configured to:
acquiring network element information of each device in a network system within a preset time period, wherein the network element information of each device comprises a hierarchy of each device in the network system;
performing upper and lower level association on the network element information of different types of equipment;
and generating a visualized topology according to the associated network element information.
Optionally, the establishing module is specifically configured to:
and determining the next-level equipment of each equipment layer by layer from the equipment at the highest level to the lowest level by adopting a traversal method so as to perform upper-level and lower-level association on the network element information of different types of equipment.
Optionally, the apparatus further comprises:
and the alarm module is used for combining the alarm information of the fault root node and the alarm information of the child nodes of the fault root node to carry out associated alarm.
The present invention provides a server comprising: a memory and a processor; the memory is connected with the processor;
the memory for storing a computer program;
the processor is configured to implement the fault diagnosis method as described above when the computer program is executed.
The present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the fault diagnosis method as described above.
The invention provides a fault diagnosis method, a fault diagnosis device and a server. The method comprises the steps of establishing the topology of a network system; acquiring alarm information in a network system; associating the alarm information with the nodes in the topology to determine a fault link; and determining a fault root node in the fault link according to a preset fault probability calculation rule. The method determines the fault link by associating the alarm information with the topology of the network system, thereby rapidly acquiring the service influence range and degree caused by the alarm, and determining the fault root node in the fault link according to the preset rule, thereby realizing the automatic positioning and accurate positioning of the fault and improving the efficiency of fault processing.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the technical solutions in the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a first schematic flow chart of a fault diagnosis method provided by the present invention;
fig. 2 is a schematic flow chart of a fault diagnosis method provided by the present invention;
FIG. 3 is a schematic diagram of a network system according to the present invention;
fig. 4 is a third schematic flow chart of a fault diagnosis method provided by the present invention;
fig. 5 is a schematic structural diagram of a fault diagnosis device provided by the present invention;
fig. 6 is a schematic structural diagram of a server according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the network equipment fails, alarm information can be generated, and maintenance personnel can check the alarm information and find and process the failure in the system. In the existing alarm processing system, usually, each alarm generated on each device is sent out an alarm message of the device, and if the alarm is frequently given out on each device due to the repeated reason, a plurality of repeated alarms are combined to send out an alarm message of the device.
For a network system with a complex structure and a large amount of equipment, because the alarm information sent by each equipment is not associated with each other, the operation condition of other equipment associated with the equipment cannot be determined through the alarm information of one equipment, and when a maintenance worker locates a fault, the maintenance worker often needs to check a large amount of alarm information of a plurality of equipment to determine the root cause of the fault, which is relatively low in efficiency. To solve the above problems, the present invention provides a fault diagnosis method.
Fig. 1 is a first schematic flow chart of a fault diagnosis method provided by the present invention. The execution subject of the method is a fault diagnosis device, which can be implemented by software and/or hardware, for example, the device can be a server. As shown in fig. 1, the method includes:
s101, establishing the topology of the network system.
The topology of the network system is presented in a multi-branch tree structure for visually showing the connection relationship between the devices in the network system. The network element information of each device in the network system includes the type of the device, the level of the device in the network system, the association relationship between the device and other devices, and the like, and according to the acquired network element information of each device, the devices with the same type and the same level can be placed in the same layer, and the devices are associated layer by layer to form a topology of a multi-branch tree structure. In practical application, because the structure of the network system changes dynamically, the network element information of the device can be acquired according to a certain time slice to establish the topology, and the time slice can be set according to actual needs.
S102, acquiring alarm information in the network system.
S103, associating the alarm information with the nodes in the topology, and determining a fault link.
The network system may have an alarm acquisition system for acquiring fault alarm information generated on each device, and the fault diagnosis apparatus may acquire the alarm information for a period of time from the alarm acquisition system. Then the alarm information is associated with the nodes in the topology, and the alarm nodes in the topology are determined; and determining the link formed by the alarm node and the father node and/or the child node of the alarm node as a fault link. During a period of time, one or more devices generating alarms may associate the alarm information with nodes in the topology, and the nodes of the alarm devices and other nodes associated with the alarm devices together form a failed link, and a certain node in the failed link may be a root cause node of the failure.
And S104, determining a fault root node in the fault link according to a preset fault probability calculation rule.
The failed link includes the node that generated the alarm and the parent and/or child nodes associated with the alarm node, and there may be one of the nodes that is the root cause node of the failure. And according to the obtained alarm information of each node, the fault probability of each node can be calculated according to a preset fault probability rule, and a fault root node is determined according to the obtained fault probability. The failure probability rule can be set according to the actual condition of the network system.
The fault diagnosis method provided by the embodiment comprises the steps of establishing a topology of a network system; acquiring alarm information in a network system; associating the alarm information with the nodes in the topology to determine a fault link; and determining a fault root node in the fault link according to a preset fault probability calculation rule. The method determines the fault link by associating the alarm information with the topology of the network system, thereby rapidly acquiring the service influence range and degree caused by the alarm, and realizes the automatic positioning and accurate positioning of the fault by determining the fault root node in the fault link according to the preset rule, thereby improving the efficiency of fault processing.
On the basis of the embodiment shown in fig. 1, a method for establishing a topology of a network system in S101 is described with reference to a specific example. Fig. 2 is a schematic flow diagram of a fault diagnosis method provided by the present invention. As shown in fig. 2, in S101, a topology of the network system is established, which includes:
s201, network element information of each device in the network system in a preset time period is obtained, wherein the network element information of each device comprises a hierarchy of each device in the network system.
The present embodiment may further include a resource acquisition system, or referred to as a resource system, where the resource acquisition system is configured to acquire, from the physical device, Network element information of each device, such as a port, an IP, a service type, a device, a bandwidth rate, and the like, and data types of the Network element information of different types of devices are different, and the Network element information is data in a Simple Network Management Protocol (SNMP) format. And the fault diagnosis device acquires the network element information of each device in a preset time period from the resource acquisition system. In the specific implementation, the network element information acquired from the resource acquisition system is placed in a cache according to the time slice, the data stored in the cache is analyzed, whether the data accords with the preset rule of the equipment network element information is judged, the information which accords with the rule is temporarily stored in a List set, the List is transmitted into a classification function as a parameter, whether elements exist in the List set is judged firstly, when the elements exist, the data type of each element is judged according to the data format agreed with the resource acquisition system, and the network element information data classified according to the data type is packaged and stored in Map.
For example, there are four types of devices in a monitoring system, namely, a SWITCH (SWITCH), an Optical Line Terminal (OLT), a Message Decoder Unit (MDU), and a CAMERA (CAMERA). The SWITCH uses hardware to complete the tasks of filtering, learning and forwarding processes by using software by a network bridge, and can also disassemble the network into network branches, segment network data streams and isolate faults occurring in the branches, so that the data information flow of each network branch can be reduced, each network is more effective, and the efficiency of the whole network is improved. The OLT is used to connect the terminal equipment of the fiber trunk. The MDU allows two or more physical links to be established between two switching devices, and it can bind all physical connections between two switching devices to a virtual transmission link, and the data exchange between the switches is performed by the virtual transmission link.
The resource acquisition system can acquire the network element information of each device in the monitoring system, wherein the SWITCH is at the highest level of the monitoring system, the OLT is connected below the SWITCH, the MDU is connected below the OLT, and the CAMERA is connected below the MDU. And the fault diagnosis device classifies and encapsulates the network element information according to the method in the process after acquiring the network element information of the four types of equipment from the resource acquisition system. The following is an example of a code for acquiring network element information data and performing classification and encapsulation on the data:
Figure BDA0002169205470000071
Figure BDA0002169205470000081
s202, performing upper and lower level association on the network element information of the different types of equipment.
And S203, generating a visual topology according to the associated network element information.
The network element information of the equipment comprises the level of the equipment in the network system and other equipment associated with the equipment, and the next-level equipment of each equipment is determined layer by layer from the equipment at the highest level to the lowest level by adopting a traversal method so as to perform upper and lower level association on the network element information of different types of equipment.
The structure of the network topology is that a chain of a lower-layer multi-branch tree structure extends upwards, and a multi-branch tree structure model is obtained by traversing and hanging data information on the chain. In concrete implementation, the network element information data in the cache is read into a RootNode instance, when a topology is created, one RootNode instance is copied, and a direct traversal and/or parallel traversal method is used according to the size of the data volume: when the data volume is small and the data structure is simple, direct traversal is used, and the consumption caused by the program is reduced; when the data volume is large and the data structure is complex, parallel traversal is used, and the traversal rate is improved. And when the time slice and the thread lock are used for performing data traversal on the network element information in the acquired time slice, firstly checking whether the thread lock exists, if so, skipping, temporarily not updating the multi-branch tree structure, and performing data streaming processing. And transmitting the data classified and packaged according to different types as parameters into corresponding hooking functions, wherein the hooking functions are used for associating the network element information data of different types, restoring nodes of the multi-branch tree structure and generating a visual topology after finishing.
Illustratively, the number of CAMERAs in the monitoring system is 18000, and when the MDU is matched with the CAMERA, parallel traversal is used; the number of the MDUs exceeds 100, and when the MDUs are matched with the OLT, parallel traversal is used; when the OLT is matched with the SWITCH, the direct traversal is used due to the small installation scale. The traversal of the whole tree structure is carried out by taking the optimal data volume and program operation as the reference to select a traversal method, and finally the hitching of SWITCHES → OLTs → MDUs → CAMERAs is completed. And after traversing and hitching are completed, a visual topological graph can be obtained. Fig. 3 is a schematic topology diagram of a network system provided in the present invention. Fig. 3 is only illustrated with 2 switches and child nodes of the 2 switches, and the structure of the topology obtained in practical application is determined by actual devices in the network system. An example of code for restoring a topology is as follows:
Figure BDA0002169205470000091
Figure BDA0002169205470000101
according to the fault diagnosis method provided by the embodiment, the network element information in the preset time period is acquired from the resource acquisition system, and the network element information of different types of equipment is connected in a traversing method, so that the topology of the network system in the preset time period can be accurately generated.
Based on the embodiment shown in fig. 1, the determination of the failure root node in the failed link according to the preset failure probability calculation rule in S104 is further described. Fig. 4 is a third schematic flow chart of a fault diagnosis method provided by the present invention. As shown in fig. 4, determining a failure root node in a failed link according to a preset failure probability calculation rule in S104 includes:
s401, according to the alarm information of each node in the fault link, the fault probability of the child node of each node and the number of the child nodes of each node, calculating the fault probability of each node.
The fault probability of each node in the topology is related to not only the alarm information of the node itself, but also the child nodes of the node, and it can be understood that if a plurality of child nodes of the node all generate alarm information, the probability that the node generates a fault is also high. The failure probability of each node in the failed link can be specifically calculated by the following formula:
Figure BDA0002169205470000102
wherein, P is the failure probability of each node; a is 0 or 1, wherein if each node has alarm information, A is 1, and if each node has no alarm information, A is 0; sum is the sum of the failure probabilities of the child nodes of each node; the count is the number of child nodes of each node.
When the fault probability of each node in a fault link is calculated, the calculation is started from the node of the lowest level, the fault probability of each node is reported to a father node after the fault probability of each node is determined, and the fault probability of the father node is determined according to the alarm information of the father node, the fault probability of the child nodes and the number of the child nodes until all the nodes are calculated.
For example, it is assumed that in the topology shown in fig. 3, alarm information is generated on each of the OLT1, the MDU1, and the MDU2, and no alarm information is generated by other nodes. The links formed by the SWTCH1 and its child nodes are faulty links, the probability of CAMERA fault at the lowest level in the links is 0, the probability of fault at MDU1 and MDU2 is 100, and the probability of fault at other MDUs is 0. The probability of failure of OLT1 is
Figure BDA0002169205470000111
OLT2 has a failure probability of 0 and SWITCH1 has a failure probability of 0The ratio was 140/3.
S402, respectively judging whether the fault probability of each node is larger than or equal to the fault threshold corresponding to each node.
And S403, if the fault probability of the first node reaches the fault threshold corresponding to the first node, adding the first node into a candidate list.
S404, determining the node with the highest level in the candidate list as the fault root node.
Nodes of different levels in the topology, namely different types of devices, have different failure threshold values, it is determined whether the failure probability of each node in the failed nodes is greater than or equal to the failure threshold value of the level where the node is located, if the failure probability of the first node is greater than or equal to the failure threshold value of the level where the node is located, namely the failure probability of the first node is higher, the first node is added into a candidate list, the nodes in the candidate list are candidates of a failed root node, and then the node with the highest level in the candidate list is determined to be the failed root node, and the failed root node can also be called as a maximum common point.
Assuming that the failure threshold of the SWITCH is 70, the failure threshold of the OLT is 120, and the failure threshold of the MDU is 100, the candidate list in the above example includes MDU1, MDU2, and OLT1, where OLT1 is the highest node in the hierarchy, and thus OLT1 is determined as the failure root node.
According to the fault diagnosis method provided by the embodiment, the fault probability of each node is calculated according to the alarm information of each node in the fault link, the fault probability of the child node of each node and the number of the child nodes of each node, the fault probability of each node is compared and judged, and the fault root node is determined, so that the root cause of the fault is quickly positioned in a complex network system, and the fault maintenance efficiency is improved.
The following is an example of a code for hooking alarm information to topology and calculating fault probability to determine the maximum common point:
1. the alarm information is matched with the topology hanging connection, and whether the equipment ID receiving the alarm is on the topology is judged.
Figure BDA0002169205470000112
Figure BDA0002169205470000121
2. Calculating fault probability and reporting buffer warning node
Figure BDA0002169205470000122
Figure BDA0002169205470000131
3. And traversing the candidate list, wherein if the father node of the current node has no alarm, the current node is the root.
Figure BDA0002169205470000132
Figure BDA0002169205470000141
On the basis of the above embodiment, the fault diagnosis method further includes: and combining the alarm information of the fault root node and the child nodes of the fault root node to carry out associated alarm. For example, the alarm information includes that the single board software is not operated normally, a cluster stack member fails, a member link delay difference of Mp-group exceeds a threshold value, an ethernet physical interface (ETPI) LOSs of signal (LOS), a system power failure, and the like. The parent alarm and the child alarm are associated, the alarm hierarchical relation can be developed and presented layer by layer, and the alarm association relation among the cross-equipment can be displayed in an associated manner.
The fault root node determined in the above embodiment is the OLT1, and the child nodes MDU1 and MDU2 also generate alarm information, and perform a correlated alarm on the alarm information of the OLT1, MDU1 and MDU2, so that the root cause of the fault and the influence range of the fault can be displayed more intuitively, and a maintenance person can perform fault maintenance in time. Meanwhile, through the correlated alarm, the alarm number in the network system can be reduced, so that maintenance personnel can conveniently browse alarm information, and the maintenance efficiency is improved.
Fig. 5 is a schematic structural diagram of a fault diagnosis device provided by the present invention. As shown in fig. 5, the failure diagnosis apparatus 50 includes:
an establishing module 501, configured to establish a topology of a network system;
an obtaining module 502, configured to obtain alarm information in a network system;
the association module 503 is configured to associate the alarm information with a node in the topology, and determine a faulty link;
a determining module 504, configured to determine a failure root node in a failed link according to a preset failure probability calculation rule.
Optionally, the associating module 503 is configured to:
the alarm information is associated with the nodes in the topology, and the alarm nodes in the topology are determined;
and determining links formed by the alarm node and the father node and/or the child node of the alarm node as fault links.
Optionally, the determining module 504 is configured to:
calculating the fault probability of each node according to the alarm information of each node in the fault link, the fault probability of the child node of each node and the number of the child nodes of each node;
respectively judging whether the fault probability of each node is greater than or equal to the fault threshold corresponding to each node;
if the fault probability of the first node is larger than or equal to the fault threshold corresponding to the first node, adding the first node into a candidate list;
and determining the node with the highest level in the candidate list as the fault root node.
Optionally, the determining module 504 is specifically configured to:
calculating the failure probability of each node in the failed link according to the following formula:
Figure BDA0002169205470000151
wherein, P is the failure probability of each node; a is 0 or 1, wherein if each node has alarm information, A is 1, and if each node has no alarm information, A is 0; sum is the sum of the failure probabilities of the child nodes of each node; the count is the number of child nodes of each node.
Optionally, the establishing module 501 is configured to:
acquiring network element information of each device in a network system within a preset time period, wherein the network element information of each device comprises a hierarchy of each device in the network system;
performing upper and lower level association on the network element information of different types of equipment;
and generating a visualized topology according to the associated network element information.
Optionally, the establishing module 501 is specifically configured to:
and determining the next-level equipment of each equipment layer by layer from the equipment at the highest level to the lowest level by adopting a traversal method so as to perform upper-level and lower-level association on the network element information of different types of equipment.
Optionally, the apparatus 50 further comprises:
and an alarm module 505, configured to combine the alarm information of the failed root node and the alarm information of the child nodes of the failed root node to perform a related alarm.
The apparatus of this embodiment may be used to execute the technical solutions of the method embodiments shown in fig. 1, fig. 2, or fig. 4, and the implementation principles and technical effects thereof are similar and will not be described herein again.
Fig. 6 is a schematic structural diagram of a server according to the present invention. As shown in fig. 6, the server 60 includes: a memory 601 and a processor 602; the memory 601 is connected to the processor 602.
A memory 601 for storing a computer program;
a processor 602 for implementing the fault diagnosis method as described above when the computer program is executed.
The present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the fault diagnosis method as described above.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. A fault diagnosis method, comprising:
establishing the topology of a network system;
acquiring alarm information in a network system;
associating the alarm information with the nodes in the topology to determine the alarm nodes in the topology;
determining links formed by the alarm nodes and father nodes and/or child nodes of the alarm nodes as fault links;
calculating the fault probability of each node according to the alarm information of each node in the fault link, the fault probability of the child node of each node and the number of the child nodes of each node;
respectively judging whether the fault probability of each node is greater than or equal to a fault threshold corresponding to each node;
if the fault probability of a first node is larger than or equal to a fault threshold corresponding to the first node, adding the first node into a candidate list;
determining a node with the highest level in the candidate list as a fault root node;
wherein, the calculating the fault probability of each node according to the alarm information of each node in the fault link, the fault probability of the child node of each node and the number of the child nodes comprises:
calculating the failure probability of each node in the failed link according to the following formula:
Figure FDA0003643749610000011
wherein P is the fault probability of each node; a is 0 or 1, wherein if each node has alarm information, A is 1, and if each node has no alarm information, A is 0; sum is the sum of the failure probabilities of the child nodes of each node; the count is the number of child nodes of each node.
2. The method of claim 1, wherein establishing the topology of the network system comprises:
acquiring network element information of each device in a network system within a preset time period, wherein the network element information of each device comprises a hierarchy of each device in the network system;
performing upper and lower level association on the network element information of different types of equipment;
and generating a visualized topology according to the associated network element information.
3. The method of claim 2, wherein the performing the upper and lower level association on the network element information of the different types of devices comprises:
and determining the next-level equipment of each equipment layer by layer from the equipment at the highest level to the lowest level by adopting a traversal method so as to associate the network element information of different types of equipment at the upper level and the lower level.
4. The method according to any one of claims 1-3, further comprising:
and merging the alarm information of the fault root node and the child nodes of the fault root node to carry out associated alarm.
5. A failure diagnosis device characterized by comprising:
the establishing module is used for establishing the topology of the network system;
the acquisition module is used for acquiring alarm information in a network system;
the correlation module is used for correlating the alarm information with nodes in the topology to determine a fault link;
the determining module is used for determining a fault root node in a fault link according to a preset fault probability calculation rule;
the determining module is specifically configured to: calculating the fault probability of each node according to the alarm information of each node in the fault link, the fault probability of the child node of each node and the number of the child nodes of each node;
respectively judging whether the fault probability of each node is greater than or equal to a fault threshold corresponding to each node;
if the fault probability of the first node is larger than or equal to the fault threshold corresponding to the first node, adding the first node into a candidate list;
determining a node with the highest level in the candidate list as the fault root node;
the association module is specifically configured to:
the alarm information is associated with the nodes in the topology, and the alarm nodes in the topology are determined;
determining a link formed by the alarm node and a father node and/or a child node of the alarm node as a fault link;
the determining module is specifically configured to:
calculating the failure probability of each node in the failed link according to the following formula:
Figure FDA0003643749610000021
wherein, P is the failure probability of each node; a is 0 or 1, wherein if each node has alarm information, A is 1, and if each node has no alarm information, A is 0; sum is the sum of the failure probabilities of the child nodes of each node; the count is the number of child nodes of each node.
6. A server, comprising: a memory and a processor; the memory is connected with the processor;
the memory for storing a computer program;
the processor, when being executed by a computer program, is adapted to implement the fault diagnosis method of any one of the preceding claims 1-4.
7. A storage medium having stored thereon a computer program for implementing the method of fault diagnosis according to any one of claims 1-4 when executed by a processor.
CN201910757406.7A 2019-08-16 2019-08-16 Fault diagnosis method and device and server Active CN110493042B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910757406.7A CN110493042B (en) 2019-08-16 2019-08-16 Fault diagnosis method and device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910757406.7A CN110493042B (en) 2019-08-16 2019-08-16 Fault diagnosis method and device and server

Publications (2)

Publication Number Publication Date
CN110493042A CN110493042A (en) 2019-11-22
CN110493042B true CN110493042B (en) 2022-09-13

Family

ID=68551384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910757406.7A Active CN110493042B (en) 2019-08-16 2019-08-16 Fault diagnosis method and device and server

Country Status (1)

Country Link
CN (1) CN110493042B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112953738B (en) * 2019-11-26 2022-06-10 中国移动通信集团山东有限公司 Root cause alarm positioning system, method and device and computer equipment
CN110995482B (en) * 2019-11-27 2022-06-21 深圳市商汤科技有限公司 Alarm analysis method and device, computer equipment and computer readable storage medium
CN111082994A (en) * 2019-12-25 2020-04-28 北京同有飞骥科技股份有限公司 Distributed resource state rapid tracking method and system
CN111107158B (en) * 2019-12-26 2023-02-17 远景智能国际私人投资有限公司 Alarm method, device, equipment and medium for Internet of things equipment cluster
CN111342997B (en) * 2020-02-06 2022-08-09 烽火通信科技股份有限公司 Construction method of deep neural network model, fault diagnosis method and system
CN113347654B (en) * 2020-03-03 2023-04-07 中国移动通信集团贵州有限公司 Method and device for determining fault type of out-of-service base station
CN113395108B (en) * 2020-03-12 2022-12-27 华为技术有限公司 Fault processing method, device and system
CN111722952A (en) * 2020-05-25 2020-09-29 中国建设银行股份有限公司 Fault analysis method, system, equipment and storage medium of business system
CN111858123B (en) * 2020-07-29 2023-09-26 中国工商银行股份有限公司 Fault root cause analysis method and device based on directed graph network
CN114095335B (en) * 2020-08-03 2023-11-03 中国移动通信集团山东有限公司 Network alarm processing method and device and electronic equipment
CN114285730A (en) * 2020-09-18 2022-04-05 华为技术有限公司 Method and device for determining fault root cause and related equipment
CN112468400A (en) * 2020-11-09 2021-03-09 青岛海信网络科技股份有限公司 Fault positioning method, device, equipment and medium
CN114500244A (en) * 2020-11-13 2022-05-13 中兴通讯股份有限公司 Network fault diagnosis method and device, computer equipment and readable medium
CN112583644B (en) * 2020-12-14 2022-10-18 华为技术有限公司 Alarm processing method, device, equipment and readable storage medium
CN112543126A (en) * 2020-12-22 2021-03-23 武汉联影医疗科技有限公司 Cloud platform monitoring method and device, computer equipment and storage medium
CN115086154A (en) * 2021-03-11 2022-09-20 中国电信股份有限公司 Fault delimitation method and device, storage medium and electronic equipment
CN112988525B (en) * 2021-03-22 2022-07-22 新华三技术有限公司 Method and device for matching alarm association rules
CN113037570B (en) * 2021-04-29 2022-12-13 中国联合网络通信集团有限公司 Alarm processing method and equipment
CN115542067A (en) * 2021-06-30 2022-12-30 华为技术有限公司 Fault detection method and device
US20230239206A1 (en) * 2022-01-24 2023-07-27 Rakuten Mobile, Inc. Topology Alarm Correlation
CN115442255B (en) * 2022-03-11 2024-02-06 北京罗克维尔斯科技有限公司 Ethernet detection method, system, device, electronic equipment and storage medium
CN114710532B (en) * 2022-04-02 2023-10-03 中国科学院水生生物研究所 Method and device for suppressing security electricity utilization alarm of museum
CN114710396B (en) * 2022-04-08 2023-06-23 中国联合网络通信集团有限公司 Network alarm processing method and server
CN115086143A (en) * 2022-04-28 2022-09-20 阿里巴巴(中国)有限公司 Fault early warning method and device
CN115102844A (en) * 2022-06-09 2022-09-23 摩拜(北京)信息技术有限公司 Fault monitoring and processing method and device and electronic equipment
CN116017516B (en) * 2023-03-24 2023-06-27 广州世炬网络科技有限公司 Node connection configuration method and device based on link interference

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679716A (en) * 2017-09-19 2018-02-09 西南交通大学 Consider the risk assessment of interconnected network cascading failure and the alarm method of communication fragile degree
CN108521346A (en) * 2018-04-07 2018-09-11 中南大学 Method for positioning abnormal nodes of telecommunication bearer network based on terminal data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101707537B (en) * 2009-11-18 2012-01-25 华为技术有限公司 Positioning method of failed link and alarm root cause analyzing method, equipment and system
CN104796273B (en) * 2014-01-20 2018-11-16 中国移动通信集团山西有限公司 A kind of method and apparatus of network fault root diagnosis
CN106603317A (en) * 2017-02-20 2017-04-26 山东浪潮商用系统有限公司 Alarm monitoring strategy analysis method based on data mining technology
CN107633307B (en) * 2017-09-08 2021-08-31 国家计算机网络与信息安全管理中心 Power supply and distribution system root alarm detection method, device, terminal and computer storage medium
CN108494591A (en) * 2018-03-16 2018-09-04 北京京东金融科技控股有限公司 system alarm processing method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679716A (en) * 2017-09-19 2018-02-09 西南交通大学 Consider the risk assessment of interconnected network cascading failure and the alarm method of communication fragile degree
CN108521346A (en) * 2018-04-07 2018-09-11 中南大学 Method for positioning abnormal nodes of telecommunication bearer network based on terminal data

Also Published As

Publication number Publication date
CN110493042A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
CN110493042B (en) Fault diagnosis method and device and server
US20200106662A1 (en) Systems and methods for managing network health
US9571334B2 (en) Systems and methods for correlating alarms in a network
US11348023B2 (en) Identifying locations and causes of network faults
US9608900B2 (en) Techniques for flooding optimization for link state protocols in a network topology
WO2015090098A1 (en) Method and apparatus for realizing fault location
CN112564964A (en) Fault link detection and recovery method based on software defined network
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN108989128B (en) Fault positioning method and device based on networking structure
US9059899B2 (en) Method and system for interrupt throttling and prevention of frequent toggling of protection groups in a communication network
CN101252477B (en) Determining method and analyzing apparatus of network fault root
CN102281103A (en) Optical network multi-fault recovering method based on fuzzy set calculation
CN111835595B (en) Flow data monitoring method, device, equipment and computer storage medium
CN107005440B (en) method, device and system for positioning link fault
CN102792636A (en) Methods, apparatus and communication network for providing restoration survivability
CN109964450B (en) Method and device for determining shared risk link group
CN116299129A (en) All-fiber current transformer state detection and analysis method, device and medium
US8566634B2 (en) Method and system for masking defects within a network
US10432451B2 (en) Systems and methods for managing network health
CN114172796A (en) Fault positioning method and related device for communication network
CN114338441A (en) Analysis method for intelligently identifying service link based on service flow
JP2013046250A (en) Failure link specification system and monitoring path setting method of the same
CN114911654A (en) Fault classification method, device and system
CN105306135A (en) Link polling detection method and device
CN109067603B (en) Method and system for determining VLAN configuration problem of transformer substation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant