CN112787841B - Fault root cause positioning method and device and computer storage medium - Google Patents

Fault root cause positioning method and device and computer storage medium Download PDF

Info

Publication number
CN112787841B
CN112787841B CN201911096747.0A CN201911096747A CN112787841B CN 112787841 B CN112787841 B CN 112787841B CN 201911096747 A CN201911096747 A CN 201911096747A CN 112787841 B CN112787841 B CN 112787841B
Authority
CN
China
Prior art keywords
abnormal
network entity
target
sub
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911096747.0A
Other languages
Chinese (zh)
Other versions
CN112787841A (en
Inventor
高云鹏
谢于明
肖欣
王仲宇
尘福兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201911096747.0A priority Critical patent/CN112787841B/en
Publication of CN112787841A publication Critical patent/CN112787841A/en
Application granted granted Critical
Publication of CN112787841B publication Critical patent/CN112787841B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies

Abstract

The application discloses a fault root cause positioning method and device and a computer storage medium, and belongs to the technical field of networks. The management device acquires a first knowledge graph of a target network with a fault, wherein the first knowledge graph is marked with an abnormal network entity which generates an abnormal event in the target network. The management equipment generates n abnormal sub-maps based on the first knowledge map, each abnormal sub-map comprises one or more abnormal network entities, when the abnormal sub-map comprises a plurality of abnormal network entities, fault propagation conditions are met between any abnormal network entity in the abnormal sub-map and one or more other abnormal network entities in the abnormal sub-map, the n abnormal sub-maps comprise all the abnormal network entities on the first knowledge map, and any abnormal network entity only belongs to one abnormal sub-map. The management device determines a root cause failure network entity in one or more exception sub-graphs. The method and the device improve the accuracy of fault root cause positioning in the network.

Description

Fault root cause positioning method and device and computer storage medium
Technical Field
The present application relates to the field of network technologies, and in particular, to a method and an apparatus for locating a fault root cause, and a computer storage medium.
Background
Because the cause of the failure in the current network is complex, for example, in a Data Center Network (DCN), network failure is caused by an Address Resolution Protocol (ARP) entry overrun, a device restart, or a router identity (router identity) conflict, and the like, the difficulty of network failure troubleshooting is high.
It is proposed to determine the root cause of a failure in a network (hereinafter referred to as failure root cause) by means of a failure tree. In the rule-based fault tree, one root cause judgment rule may correspond to one fault root cause, and when the performance of the network data acquired in the fault scene conforms to one root cause judgment rule, it may be determined that the fault root cause of the fault scene is the fault root cause corresponding to the root cause judgment rule. Wherein, one root cause decision rule may be a combination of a plurality of single rules through and gates and or gates.
However, the current fault tree is usually constructed based on the fault propagation rule of a single device, and fault propagation may occur between different devices in an actual network, so that the fault tree cannot accurately locate a fault root in the network. Therefore, the accuracy of fault root cause positioning in the network by adopting the fault tree is lower.
Disclosure of Invention
The application provides a fault root cause positioning method and device and a computer storage medium, which can solve the problem of low accuracy of fault root cause positioning in the current network.
In a first aspect, a method for locating a fault root cause is provided. The method comprises the following steps:
the management equipment acquires a first knowledge graph of a target network with a fault, wherein an abnormal network entity which generates an abnormal event in the target network is identified on the first knowledge graph, and the type of the network entity on the first knowledge graph is network equipment, an interface, a protocol or service. The management equipment generates n abnormal sub-maps based on the first knowledge map, each abnormal sub-map comprises one or more abnormal network entities, when the abnormal sub-map comprises a plurality of abnormal network entities, fault propagation conditions are met between any abnormal network entity in the abnormal sub-map and one or more other abnormal network entities in the abnormal sub-map, the n abnormal sub-maps comprise all abnormal network entities on the first knowledge map, any abnormal network entity only belongs to one abnormal sub-map, and n is a positive integer. The management device determines a root fault network entity in the abnormal sub-map for one or more abnormal sub-maps in the n abnormal sub-maps. The root cause failure network entity refers to an abnormal network entity which is a failure root cause.
In the method and the device, the knowledge graph is generated based on the whole network, and the fault propagation conditions corresponding to the knowledge graph are also based on the whole network, so that the fault propagation among the devices can be considered when the fault root is positioned in the network by adopting the knowledge graph, and the fault root positioning accuracy in the network is improved. By dividing the knowledge graph into n abnormal sub-graphs, fault propagation conditions are met between abnormal network entities in each abnormal sub-graph, fault grouping in a target network is achieved, fault root cause positioning can be respectively carried out on the basis of each abnormal sub-graph in the follow-up process of the management equipment, the scale of the knowledge graph is reduced, and the fault root cause positioning efficiency can be effectively improved.
Optionally, the fault propagation conditions comprise one or more of a fault propagation relationship, a fault propagation time condition and a fault propagation probability condition. The fault propagation relationship is used to indicate a path along which the fault is propagated in the communication network. The two network entities satisfy the fault propagation relation, namely, the two network entities are positioned on the same fault propagation path. And the two abnormal network entities meet the fault propagation time condition, and the time interval of the fault occurrence time of the two abnormal network entities is smaller than the target time length. And the two abnormal network entities meet the fault propagation probability condition, and the fault propagation probability corresponding to the fault propagation relation between the two abnormal network entities is larger than the target probability threshold.
Optionally, the implementation process of the management device generating n abnormal sub-maps based on the first knowledge-map includes:
the management device acquires an abnormal network entity set, wherein the abnormal network entity set comprises all abnormal network entities on the first knowledge graph. And the management equipment repeatedly executes the sub-graph spectrum generation process until the abnormal network entity set is an empty set, so as to obtain n abnormal sub-graph spectrums. The sub-graph spectrum generation flow comprises the following steps:
the management device selects a starting abnormal network entity from the abnormal network entity set. The management device executes a target matching process on the initial abnormal network entity to obtain an abnormal sub-map including the initial abnormal network entity. And the management equipment deletes all abnormal network entities in the abnormal sub-map from the abnormal network entity set to obtain an updated abnormal network entity set.
The target matching process comprises the following steps:
the management device acquires all target nearest neighbor abnormal network entities of the initial abnormal network entity based on the first knowledge graph, wherein other abnormal network entities do not exist between the target nearest neighbor abnormal network entities and the initial abnormal network entities, and the target nearest neighbor abnormal network entities are not located in the abnormal sub-graph where the initial abnormal network entities are located. For each target nearest neighbor abnormal network entity:
when the fault propagation condition is met between the target nearest neighbor abnormal network entity and the initial abnormal network entity, the management equipment adds the target nearest neighbor abnormal network entity to the abnormal sub-map where the initial abnormal network entity is located, takes the target nearest neighbor abnormal network entity as a new initial abnormal network entity, and executes the target matching process again. When the fault propagation condition is not satisfied between the target nearest neighbor abnormal network entity and the initial abnormal network entity, the management device determines that the target nearest neighbor abnormal network entity does not belong to the abnormal sub-map in which the initial abnormal network entity is located.
In one implementation, when the target nearest-neighbor abnormal network entity and the initiating abnormal network entity are two adjacent network entities, a fault propagation condition is satisfied between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, including:
the target nearest neighbor abnormal network entity and the initial abnormal network entity have a fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than the target duration, and the fault propagation probability corresponding to the fault propagation relation is greater than the target probability threshold.
In another implementation, when a normal network entity exists between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, a fault propagation condition is satisfied between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, including:
the target nearest neighbor abnormal network entity and the normal network entity have a first fault propagation relation, the initial abnormal network entity and the normal network entity have a second fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than the target duration, and the fault propagation probability corresponding to the first fault propagation relation and the fault propagation probability corresponding to the second fault propagation relation are both greater than the target probability threshold.
Optionally, after the management device acquires the first knowledge graph of the failed target network, the management device further acquires m second knowledge graphs of the target network corresponding to m moments, where the m moments correspond to the m second knowledge graphs one-to-one, the m moments are different from the generation moments of the first knowledge graph, and m is a positive integer. And the management equipment completes the network entity connection relation in the first knowledge graph according to the m second knowledge graphs.
According to the method and the device, the missing connection relation in the current knowledge graph is determined through sub-graph comparison, and the connection relation with high confidence coefficient is filled on the current knowledge graph, so that the problem that the source tracing of the fault root is inaccurate finally caused by the missing connection relation on the knowledge graph due to network faults can be solved, and the accuracy of locating the fault root is further improved.
Optionally, the implementing process of the management device to fill up the network entity connection relationship in the first knowledge graph according to the m second knowledge graphs includes:
the management device executes a connection relation completion process for each abnormal network entity in the first knowledge graph. The connection relation completion flow comprises the following steps:
the management device acquires a first sub-map from the first knowledge map, wherein the first sub-map comprises an abnormal network entity and all network entities having connection relations with the abnormal network entity, and the connection relations comprise direct connection relations and/or indirect connection relations. The management equipment respectively acquires a second sub-map from each second knowledge map according to the identification of the abnormal network entity to obtain m second sub-maps, wherein the second sub-maps comprise a target network entity and all network entities which have connection relations with the target network entity, and the identification of the target network entity is the same as the identification of the abnormal network entity. The management device acquires a target connection relation based on the first sub-graph spectrum and the m second sub-graph spectrums, wherein the target connection relation satisfies the following conditions: the target connection relation is not included in the first sub-map, and one or more second sub-maps in the m second sub-maps include the target connection relation. When the confidence degree of the target connection relation is larger than the confidence degree threshold value, the management equipment adds the target connection relation into the first sub-map, and the confidence degree of the target connection relation is positively correlated with the occurrence frequency of the target connection relation in the m second sub-maps.
Optionally, the number of occurrences of the target connection relation in the m second sub-maps is c, the confidence of the target connection relation is equal to c/m, and c is a positive integer.
Optionally, the determining, by the management device, an implementation procedure of the root cause failure network entity in the abnormal sub-graph includes:
the management device calculates the degree of each abnormal network entity in the abnormal sub-map. The management device determines the abnormal network entity with the out degree of 0 as the root fault network entity in the abnormal sub-map.
Optionally, after determining the root cause failure network entity in the abnormal sub-map, for each root cause failure network entity in the abnormal sub-map, the management device determines, based on a target path where the root cause failure network entity is located, a probability that the root cause failure network entity is a failure root cause, where the target path is a path with the root cause failure network entity as a tail node. The management device outputs a fault root cause of the target network, the fault root cause comprises fault results corresponding to the n abnormal sub-graphs respectively, and the fault results comprise each root cause fault network entity in the abnormal sub-graphs and the probability that each root cause fault network entity is the fault root cause.
In the application, the management device can not only output the root cause fault network entities (namely fault roots in the target network) in the knowledge graph corresponding to the target network, but also output the probability that each root cause fault network entity is a fault root, so that operation and maintenance personnel can deal with each fault problem in a targeted manner, and further the network repair efficiency is improved.
Optionally, the implementation process of determining, by the management device, the probability that the root cause failure network entity is the failure root cause based on the target path where the root cause failure network entity is located includes:
and the management equipment determines the fault propagation probability corresponding to the target path where the root cause network entity is located. When the number of the target paths where the failure network entity is located is equal to 1, the management device takes the failure propagation probability corresponding to the target paths as the probability that the failure network entity is the failure root. When the number of the target paths where the failure network entity is located is greater than 1, the management device takes the failure propagation probability corresponding to the specified target path as the probability that the failure network entity is the failure root cause, and the specified target path is the target path with the minimum failure probability in all the target paths where the failure network entity is located.
Optionally, the implementation process of the management device determining the fault propagation probability corresponding to the target path where the root cause fault network entity is located includes:
the management device obtains all fault propagation relations between network entities on the target path. The management equipment determines the fault propagation probability corresponding to the target fault propagation relation as the fault propagation probability corresponding to the target path, and the target fault propagation relation is the fault propagation relation with the minimum corresponding fault propagation probability in all the fault propagation relations.
Optionally, the implementation process of the management device acquiring the first knowledge graph of the failed target network includes:
when the target network fails, the management device acquires an abnormal event generated in the target network. The management device identifies an abnormal network entity generating an abnormal event in the target network on an initial knowledge graph of the target network to obtain a first knowledge graph, wherein the initial knowledge graph is generated based on network data of the target network, the network data comprises networking topology of the target network and device information of a plurality of network devices in the target network, and the device information comprises one or more of interface configuration information, protocol configuration information and service configuration information.
Optionally, when the target network fails, the management device may further obtain network data of the target network. The management device extracts a plurality of knowledge-graph triplets from the network data, each knowledge-graph triplet including two network entities and a relationship between the two network entities. The management device generates an initial knowledge-graph from the plurality of knowledge-graph triplets.
Optionally, the exception event carries a fault occurrence time of an exception network entity that generates the exception event.
Optionally, the abnormal event includes one or more of an alarm log, a status change log, and an abnormal key performance indicator.
In a second aspect, a fault root cause locating device is provided. The apparatus comprises a plurality of functional modules that interact to implement the method of the first aspect and its embodiments described above. The functional modules can be implemented based on software, hardware or a combination of software and hardware, and the functional modules can be combined or divided arbitrarily based on specific implementation.
In a third aspect, a fault root cause locating device is provided, which includes: a processor and a memory;
the memory for storing a computer program, the computer program comprising program instructions;
the processor is configured to invoke the computer program to implement the fault root cause location method according to any one of the first aspect.
In a fourth aspect, a computer storage medium is provided, which stores instructions that, when executed by a processor, implement the fault root cause localization method according to any one of the first aspect.
In a fifth aspect, a chip is provided, where the chip includes a programmable logic circuit and/or program instructions, and when the chip is running, the fault root cause locating method according to any one of the first aspect is implemented.
The beneficial effect that technical scheme that this application provided brought includes at least:
because the knowledge graph is generated based on the whole network, and the fault propagation condition corresponding to the knowledge graph is also based on the whole network, the fault propagation between the devices can be considered when the knowledge graph is adopted to carry out the fault root cause positioning in the network, and the fault root cause positioning accuracy in the network is improved. By dividing the knowledge graph into n abnormal sub-graphs, fault propagation conditions are met between abnormal network entities in each abnormal sub-graph, fault grouping in a target network is achieved, fault root cause positioning can be respectively carried out on the basis of each abnormal sub-graph in the follow-up process of the management equipment, the scale of the knowledge graph is reduced, and the fault root cause positioning efficiency can be effectively improved.
In addition, the connection relation missing in the current knowledge graph is determined through sub-graph comparison, and the connection relation with high confidence coefficient is filled on the current knowledge graph, so that the problem that the source tracing of the fault root cause is inaccurate finally caused by the fact that the connection relation missing on the knowledge graph is caused by network faults can be solved, and the accuracy of fault root cause positioning is further improved.
Drawings
Fig. 1 is a schematic view of an application scenario involved in a fault root cause location method provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a fault root cause locating method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an initial knowledge-graph structure provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a first knowledge-graph structure provided by an embodiment of the present application;
FIG. 5 is a flow chart illustrating a process for supplementing connections in a knowledge-graph according to an embodiment of the present disclosure;
FIG. 6 is a schematic structural diagram of a first sub-map provided in an embodiment of the present application;
FIG. 7 is a schematic structural diagram of m second sub-maps provided in an embodiment of the present application;
FIG. 8 is a schematic structural diagram of another first sub-map provided in an embodiment of the present application;
fig. 9 is a schematic structural diagram of a sub-graph spectrum sample provided in the embodiment of the present application;
FIG. 10 is a schematic structural diagram of another sub-atlas sample provided in the embodiments of the application;
FIG. 11 is a schematic diagram of anomaly sub-map matching provided by an embodiment of the present application;
FIG. 12 is a schematic structural diagram of an anomaly sub-map provided in an embodiment of the present application;
fig. 13 is a schematic structural diagram of a fault root cause locating device according to an embodiment of the present disclosure;
fig. 14 is a schematic structural diagram of another fault root cause locating device provided in the embodiment of the present application;
fig. 15 is a schematic structural diagram of another fault root cause locating device provided in the embodiment of the present application;
fig. 16 is a schematic structural diagram of another fault root cause locating device provided in the embodiment of the present application;
fig. 17 is a block diagram of a fault root cause locating device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an application scenario involved in a fault root cause positioning method provided in an embodiment of the present application. As shown in fig. 1, the application scenario includes a management device 101 and network devices 102a-102c (collectively referred to as network devices 102) in a communication network. The number of the management devices and the network devices in fig. 1 is only used as an illustration, and is not used as a limitation to an application scenario related to the fault root cause positioning method provided in the embodiment of the present application. The communication network may be a Data Center Network (DCN), a metropolitan area network, a wide area network, a campus area network, a Virtual Local Area Network (VLAN), a virtual extended local area network (VXLAN), or the like, and the type of the communication network is not limited in the embodiments of the present application.
Alternatively, the management device 101 may be a server, a server cluster composed of several servers, or a cloud computing service center. The network device 102 may be a switch or router, etc. Optionally, with continued reference to fig. 1, the application scenario may further include a control device 103. The control device 103 is used to manage and control the network device 102 in the communication network. The management device 101 and the control device 103 are connected via a wired network or a wireless network, and the control device 103 and the network device 102 are connected via a wired network or a wireless network. The control device 103 may be a network controller, network management device, gateway or other device having control capabilities. The control device 103 may be one or more devices.
Among them, the control device 103 may store therein a networking topology of a communication network managed by the control device 103. The control device 103 is also configured to collect device information of the network device 102 in the communication network, an abnormal event generated in the communication network, and the like, and to provide the management device 101 with the networking topology of the communication network, the device information of the network device 102, the abnormal event generated in the communication network, and the like. The device information of the network device includes network configuration information and/or routing table entries of the network device. The network configuration information generally includes interface configuration information, protocol configuration information, service configuration information, and the like. Alternatively, the control device 103 may periodically collect device information of the network device 102 and abnormal events generated in the communication network. For example, the control device may adopt Simple Network Management Protocol (SNMP) or network telemetry (network telemetry) technology to collect abnormal information of the network device and abnormal events generated in the communication network. When the device information of the network device 102 is changed, the network device 102 actively reports the changed device information to the control device 103; when a communication network fails, the network device 102 actively reports the generated abnormal event to the control device 103. Of course, in some application scenarios, the management device may also be directly connected to a network device in the communication network, that is, the application scenario may not include the control device, which is not limited in this embodiment of the present application.
Fig. 2 is a schematic flow chart of a fault root cause locating method according to an embodiment of the present application. It can be applied to the management device 101 in the application scenario as shown in fig. 1. As shown in fig. 2, the method includes:
step 201, obtaining a first knowledge graph of a target network with a fault, where an abnormal network entity generating an abnormal event in the target network is identified on the first knowledge graph.
The type of network entity on the first knowledge-graph is a network device, interface, protocol, or service. Optionally, the implementation process of step 201 includes:
in step 2011, when the target network fails, the management device acquires an abnormal event generated in the target network.
Optionally, the exception event carries a fault occurrence time of the exception network entity that generates the exception event.
The failure of the target network refers to the failure of network equipment in the target network, and the failure types of the network equipment include interface failure, protocol failure (including failure in normal protocol message transmission and reception, etc.), service failure, and the like. Optionally, the exception event includes one or more of an alarm log, a status change log, and an exception performance indicator (KPI). The alarm log includes the identifier of the abnormal network entity in the network device and the alarm type. The state change log includes configuration file change information and/or routing table entry change information, for example, the state change log may include information such as "access subinterface delete" and "destination IP host route delete". The abnormal key performance index is used for describing that a certain index of a certain network entity is abnormal.
In step 2012, the management device identifies an abnormal network entity in the target network that generates the abnormal event on the initial knowledge graph of the target network, resulting in a first knowledge graph.
The initial knowledge-graph is generated based on network data of the target network. The network data of the target network includes networking topology of the target network and device information of a plurality of network devices in the target network. The device information of the network device includes network configuration information of the network device, and specifically includes one or more of interface configuration information, protocol configuration information, and service configuration information. The device information may also include routing table entries and the like. Optionally, the interface configuration information of the network device includes an Internet Protocol (IP) address of the interface, a Protocol type supported by the interface, a service type supported by the interface, and the like. The protocol configuration information of the network device includes an identifier of the protocol, the identifier of the protocol is used for uniquely identifying the protocol, and the identifier of the protocol can be represented by characters, letters and/or numbers. The service Configuration information of the network device includes services used by the network device, such as Virtual Private Network (VPN) services and/or Dynamic Host Configuration Protocol (DHCP) services.
Optionally, when the target network fails, the management device may further obtain network data of the target network, extract a plurality of knowledge-graph triples from the network data, and then generate the initial knowledge graph according to the plurality of knowledge-graph triples. Wherein each triplet of knowledge-graph includes two network entities and a relationship between the two network entities. The relationship between two network entities may be a dependency, dependency or peer relationship, etc. Illustratively, the relationship between the network device and the interface is a dependency relationship, i.e. the interface belongs to the network device. Also illustratively, the relationship between the two interfaces establishing a communication connection is a peer-to-peer relationship.
Alternatively, the network entity of the type of the network device in the knowledge graph may be represented by a name of the network device, a Media Access Control (MAC) address, a hardware address, an Open Shortest Path First (OSPF) route (abbreviated as OsRouter, which may uniquely identify the network device at an OSPF layer), or another identifier that can uniquely identify the network device. A network entity of the type interface may be represented by the name of the interface. A network entity of the type protocol may be represented by an identifier of the protocol. The knowledge-graph triples are represented in a graph form, and are composed of two basic elements, namely points and edges, wherein the points represent network entities, and the edges represent relationships between the two network entities, such as dependency, dependency or peer-to-peer relationships. When two network entities are in a peer-to-peer relationship, a non-directional edge may be used to connect the two network entities. When there is a dependency or dependency between two network entities, the two network entities may be connected by using a directional edge (e.g., an arrow), the direction of the edge is pointed to the depended network entity by the dependent network entity, or the direction of the edge is pointed to the depended network entity by the dependent network entity.
Optionally, the management device extracts, based on an abstract service model corresponding to a network type of the target network, structured json data corresponding to the knowledge graph triple from the network data, where the json data may include, for example, an OsRouter, a network segment (abbreviated as OsNetwork) of an OSPF layer, physical interface information on the network device, OSPF neighbor state change information, and state value change information of a Border Gateway Protocol (BGP) state machine. The extracted json data is then parsed and converted into knowledge-graph triples. Wherein the abstract business model is used for reflecting the relation between different network entities. The abstract business models for different network types may be different. An abstract business model is essentially a data object that defines the dependencies between different network entities. For example, in the abstract business model can be defined: each network device has one or more interfaces, i.e. the interfaces belong to the network device; the interface may carry forwarding services, for example, the interface may carry three-layer IP forwarding services, that is, the interface supports forwarding a packet by using an Interior Gateway Protocol (IGP), that is, the three-layer IP forwarding services or the IGP depend on the interface; a VXLAN tunnel, a Traffic Engineering (TE) tunnel, and BGP may be carried over the three-layer IP forwarding service, that is, the VXLAN tunnel, the TE tunnel, and BGP depend on the three-layer IP forwarding service; VPN service can be carried over the TE tunnel, that is, the VPN service depends on the TE tunnel; and so on. Wherein, the three-layer IP forwarding service can bear VXLAN tunnel, which means that the interface bearing the three-layer IP forwarding service can be used as the end point of VXLAN tunnel; the three-layer IP forwarding service can bear a TE tunnel, and an interface bearing the three-layer IP forwarding service can be used as an end point of the TE tunnel; BGP can be borne on the three-layer IP forwarding service, and the interface bearing the three-layer IP forwarding service can receive and transmit protocol messages based on BGP; the above TE tunnel may carry VPN traffic, which means that the interface carrying the TE tunnel may support VPN traffic.
Optionally, the management device may extract structured json data corresponding to the knowledge-graph triples from the network configuration information of the network device, or may extract structured json data corresponding to the knowledge-graph triples based on the routing table entry of the network device.
Alternatively, the management device may periodically acquire device information of network devices in the target network and generate an initial knowledge-graph of the target network. The management device may also store the initial knowledge-graph of the target network in the management device or a storage device connected to the management device after generating the initial knowledge-graph of the target network for subsequent use, e.g., the initial knowledge-graph of the target network may serve as a basis for determining a fault propagation relationship between network entities, and/or as a basis for fault root cause inference, etc. For example, when the target network fails in a certain period, the management device may identify an abnormal network entity that generates an abnormal event on the initial knowledge graph corresponding to the period, obtain the knowledge graph identified with the abnormal network entity, and further improve the acquisition efficiency of the knowledge graph identified with the abnormal network entity.
Illustratively, assume that the target network includes two network devices, network device a and network device B, respectively. Network device a has 3 interfaces named 10GE1/0/1, 10GE1/0/2 and 10GE1/0/3, respectively. Network device B has 4 interfaces named 10GE3/0/1, 10GE3/0/2, 10GE3/0/3, and 10GE3/0/4, respectively. Network device a and network device B both support the OSPF protocol, which is an IGP. The identifier of the OSPF protocol in network device a is denoted by 10.89.46.25 and includes 3 routing IPs, 11.11.11.11, 11.11.11.12 and 11.11.11.13 respectively. The identifier of the OSPF protocol in network device B is represented by 10.89.49.37 and includes 4 routing IPs, 11.12.11.11, 11.12.11.12, 11.12.11.13, and 11.12.11.14. The interface '10 GE 1/0/2' of the network device A is connected with the interface '10 GE 3/0/2' of the network device B, and the two interfaces adopt OSPF protocol communication, wherein the route IP adopted by the interface '10 GE 1/0/2' of the network device A is 11.11.11.11, and the route IP adopted by the interface '10 GE 3/0/2' of the network device B is 11.12.11.14. An initial knowledge-graph as shown in fig. 3 may be derived based on the network data.
Further, assuming that the interface "10 GE 1/0/2" of the network device a fails and the route IP "11.11.11.11" is not connected, so that the target network fails, the network entity corresponding to the interface "10 GE 1/0/2" and the network entity corresponding to the route IP "11.11.11.11" may be identified as abnormal network entities on the initial knowledge graph shown in fig. 3, referring to fig. 4, and the abnormal network entities may be identified by connecting abnormal event entities to the abnormal network entities. The abnormal event entity can be distinguished from the network entity by adopting a special graph or color and the like. For example, referring to FIG. 4, an exceptional entity may be represented using a triangle.
Step 202, m second knowledge graphs of the target network corresponding to m moments are obtained.
The m time instants are in one-to-one correspondence with the m second knowledge-maps, i.e., the target network corresponds to one second knowledge-map at each of the m time instants. The m times are all different from the generation time of the first knowledge graph, and m is a positive integer.
Optionally, the m time instants are located before the generation time instant of the first knowledge graph in time sequence, and the management device acquires m second knowledge graphs of the target network corresponding to the m time instants, that is, the management device acquires m second knowledge graphs corresponding to the m past time instants of the target network.
And step 203, according to the m second knowledge graphs, completing the network entity connection relation in the first knowledge graph.
Optionally, the implementation process of step 203 includes: the management device executes a connection relation completion process for each abnormal network entity in the first knowledge graph. As shown in fig. 5, the connection relation completion process includes:
step 2031, the management device obtains a first sub-map from the first knowledge-map.
The first sub-graph spectrum includes an abnormal network entity and all network entities having a connection relationship with the abnormal network entity. The connection relationship includes a direct connection relationship and/or an indirect connection relationship.
Optionally, the management device searches all network entities with connection orders smaller than or equal to N with the abnormal network entity in the first knowledge graph according to the configured maximum connection order N to form a first sub-graph. N is a positive integer, and the value of N can be 1 or 2. When the value of N is 1, the first sub-graph spectrum includes an abnormal network entity and all network entities directly connected to the abnormal network entity. When the value of N is 2, the first sub-graph spectrum includes an abnormal network entity, all network entities directly connected to the abnormal network entity, and all network entities having a second-order connection relationship with the abnormal network entity. Two network entities have N-order connection relationship, that is, there are (N-1) network entities between the two network entities.
For example, assuming that the value of N is 1, fig. 6 is a schematic structural diagram of a first sub-map provided in the embodiment of the present application. As shown in fig. 6, the first sub-graph spectrum includes an abnormal network entity a and a network entity B connected to the abnormal network entity a.
Step 2032, the management device obtains one second sub-map from each second knowledge map of the m second knowledge maps respectively according to the identifier of the abnormal network entity, and obtains m second sub-maps.
The identification of the anomalous network entity is used to uniquely identify the anomalous network entity in the first knowledge-graph. Optionally, the identifier of the abnormal network entity may be obtained by combining the name of the abnormal network entity, the type identifier of the abnormal network entity, and the network device identifier corresponding to the abnormal network entity. The second sub-graph spectrum comprises a target network entity and all network entities having the connection relation with the target network entity, and the identification of the target network entity is the same as that of the abnormal network entity.
Optionally, the management device obtains the target network entity from the second knowledge graph according to the identifier of the abnormal network entity, and then searches all network entities with connection orders smaller than or equal to N with the target network entity in the second knowledge graph according to the configured maximum connection order N to form a second sub-graph.
Exemplarily, fig. 7 is a schematic structural diagram of m second sub-maps provided in an embodiment of the present application. As shown in fig. 7, the m second sub-maps include 3 second sub-maps L1, L2, and L3. The second sub-map L1 includes a target network entity a ', and a network entity B, a network entity C, and a network entity D connected to the target network entity a'. The second sub-map L2 includes a target network entity a ', and a network entity B, a network entity D, and a network entity E connected to the target network entity a'. The third sub-map L3 includes a target network entity a 'and a network device D and a network entity E connected to the target network entity a'.
Step 2033, the management device obtains a target connection relationship based on the first sub-graph spectrum and the m second sub-graph spectrums.
The target connection relation satisfies: the target connection relation is not included in the first sub-map, and one or more second sub-maps in the m second sub-maps include the target connection relation.
Illustratively, in combination with the examples in step 2031 and step 2032, the target connection relationship includes target network entity a ' connected network entity C, target network entity a ' connected network entity D, and target network entity a ' connected network entity E.
Step 2034, when the confidence of the target connection relation is greater than the confidence threshold, the management device adds the target connection relation to the first sub-map.
The confidence of the target connection relation is positively correlated with the occurrence frequency of the target connection relation in the m second sub-maps. Optionally, the number of occurrences of the target connection relation in the m second sub-maps is c, the confidence of the target connection relation is equal to c/m, and c is a positive integer.
Illustratively, assuming that the confidence threshold is 0.5, referring to the example in step 2033, the confidence of the target network entity a ' connected network entity C is 1/3, the confidence of the target network entity a ' connected network entity D is 1, and the confidence of the target network entity a ' connected network entity E is 2/3, the management device may add the connection relationships of the target network entity a ' connected network entity D and the target network entity a ' connected network entity E to the first sub-map, that is, supplement the connection relationship between the abnormal network entity a and the network entity D and the connection relationship between the abnormal network entity a and the network entity E in the first sub-map, to obtain the first sub-map as shown in fig. 8.
When the target network fails, network data of the target network may change, and further, some connection relations between network entities may change, so that a knowledge graph generated based on the network data when the target network fails and knowledge graphs corresponding to other moments may have a certain difference in connection relations, and further, a result of tracing the root cause of the failure may be affected. In the embodiment of the application, the missing connection relation in the current knowledge graph is determined in a sub-graph comparison mode, and the connection relation with high confidence coefficient is filled in the current knowledge graph, so that the problem that the tracing of the fault root cause is inaccurate finally caused by the missing connection relation on the knowledge graph due to network faults can be solved, and the accuracy of locating the fault root cause is improved.
And step 204, generating n abnormal sub-maps based on the first knowledge-map.
Each anomaly sub-map includes one or more anomaly network entities. When a plurality of abnormal network entities are included in the abnormal sub-map, fault propagation conditions are met between any abnormal network entity in the abnormal sub-map and one or more other abnormal network entities in the abnormal sub-map. The n abnormal sub-graph spectrums comprise all abnormal network entities on the first knowledge graph, and n is a positive integer.
Optionally, the fault propagation conditions comprise one or more of a fault propagation relationship, a fault propagation time condition and a fault propagation probability condition. The fault propagation relationship is used to indicate a path along which the fault is propagated in the communication network. The two network entities satisfy the fault propagation relation, namely, the two network entities are positioned on the same fault propagation path. And the two abnormal network entities meet the fault propagation time condition, and the time interval of the fault occurrence time of the two abnormal network entities is smaller than the target time length. And the two abnormal network entities meet the fault propagation probability condition, and the fault propagation probability corresponding to the fault propagation relation between the two abnormal network entities is larger than the target probability threshold.
In this embodiment of the present application, the process of acquiring the fault propagation relationship by the management device may include the following steps a 1-A3:
in step a1, the management device obtains a knowledge graph sample, where all abnormal network entities that generate abnormal events in the network to which the knowledge graph sample belongs are identified on the knowledge graph sample when a failure occurs in the network to which the knowledge graph sample belongs.
Optionally, the network to which the knowledge-graph sample belongs is a target network, or the network to which the knowledge-graph sample belongs is another network of the same type as the network of the target network.
In step a2, the management device selects multiple abnormal network entities as center nodes in the knowledge-graph samples, and determines one or more sub-graph spectrum samples based on each center node, wherein each sub-graph spectrum sample comprises the center node and a nearest adjacent abnormal network entity of the center node.
No other abnormal network entity exists between the nearest neighbor abnormal network entity of the central node and the central node. Alternatively, the central node may have a nearest neighbor abnormal network entity directly or indirectly connected to the central node. The central node is directly connected to the nearest neighbor abnormal network entity, i.e. there is no other network entity between the central node and the nearest neighbor abnormal network entity. The central node is indirectly connected with the nearest neighbor abnormal network entity, that is, one or more normal network entities exist between the central node and the nearest neighbor abnormal network entity.
Optionally, in the sub-graph spectrum sample determined by the management device, the connection order between the nearest neighbor abnormal network entity of the central node and the central node is less than or equal to q, where q is a positive integer. Optionally, q may take the value 2.
Exemplarily, fig. 9 and fig. 10 are respectively schematic structural diagrams of a sub-graph spectrum sample provided in an embodiment of the present application. As shown in fig. 9, both abnormal network entities in the sub-graph spectrum sample are OsNetwork, and the connection order between the two abnormal network entities is equal to 1. As shown in fig. 10, two abnormal network entities in the sub-graph spectrum sample are BGP Peer and OsNetwork, respectively, and the two abnormal network entities are connected by an L3link, that is, the connection order between the two abnormal network entities is equal to 2.
In step a3, the management device determines a fault propagation relationship based on the plurality of sub-graph spectrum samples.
In some embodiments, the management device may convert the plurality of sub-graph spectrum samples into graph embedding vectors, respectively, according to a graph embedding algorithm, resulting in a plurality of graph embedding vectors corresponding to the plurality of sub-graph spectrum samples one-to-one. A plurality of sub-graph sets is determined according to a plurality of graph embedding vectors and a clustering algorithm, each sub-graph set of the plurality of sub-graph sets including at least one sub-graph spectrum sample of a plurality of sub-graph spectrum samples. And extracting fault propagation relations from the subgraph spectrum samples included in each subgraph set in the plurality of subgraph sets according to a frequent subgraph mining algorithm.
As an example, the implementation process of the management device determining the plurality of subgraph sets according to the plurality of graph embedding vectors and the clustering algorithm may be: a similarity between each two of the plurality of graph embedding vectors is determined. And clustering the plurality of sub-graph spectrum samples according to the determined similarity and clustering algorithm to obtain a plurality of sub-graph sets.
Because the graph embedding vector can represent the sub-graph spectrum sample, the management device can cluster the plurality of sub-graph spectrum samples according to the clustering algorithm according to the similarity between every two graph embedding vectors in the plurality of graph embedding vectors to obtain a plurality of sub-graph sets.
In other embodiments, the network device may extract the fault propagation relationship from a plurality of sub-graph samples according to a frequent sub-graph mining algorithm. That is, the network device does not need to perform graph embedding vector conversion and sub-graph spectrum sample clustering, but directly extracts the fault propagation relation from a plurality of sub-graph spectrum samples according to a frequent sub-graph mining algorithm. Of course, the embodiment of the present application is described by taking a frequent subgraph mining algorithm as an example, and the network device may also extract the fault propagation relationship from a plurality of sub-map samples according to other algorithms, which is not listed here any more.
It should be noted that the number of the fault propagation relationships extracted by the management device according to the frequent subgraph mining algorithm may be 0, may also be 1, and of course, may also be greater than 1. Moreover, some sub-map samples may not extract the fault propagation relationship, some sub-map samples may extract the fault propagation relationship of which the number is greater than or equal to 1, and two or more sub-map samples may also extract the same fault propagation relationship.
Optionally, the graph embedding algorithm may be algorithms such as graph2vec and GNN graph neural network, the clustering algorithm may be algorithms such as Kmeans and APs, and the frequent subgraph mining algorithm may be algorithms such as gSpan and CloseGraph, which is not limited in the embodiment of the present application.
The fault propagation relationship may be expressed in a text form or a graphic form. For example, for a text-form fault propagation relation "OsNetwork-L3 link-BGPpeer", the fault propagation relation is used to indicate that a neighbor protocol state fault in the OsNetwork causes BGP Loopback interface IP unreachable (L3link), and finally causes BGP neighbor (BGP Peer) link breaking.
In this embodiment of the application, the fault propagation relationship determined by the management device may include a 1-order fault propagation relationship (e.g., the fault propagation relationship shown in fig. 9) and a 2-order fault propagation relationship (e.g., the fault propagation relationship shown in fig. 10), and may also include higher-order fault propagation relationships such as 3-order and 4-order, which is not limited to this.
After the management device determines the fault propagation relation according to the plurality of sub-graph spectrum samples, the fault propagation probability and/or the fault propagation time corresponding to the extracted fault propagation relation can be determined. That is, the management device may determine the fault propagation probability corresponding to the extracted fault propagation relationship, may also determine the fault propagation time corresponding to the extracted fault propagation relationship, and may also determine the fault propagation probability corresponding to the extracted fault propagation relationship and the corresponding fault propagation time.
In some embodiments, the implementation process of the management device determining the fault propagation time corresponding to the extracted fault propagation relationship may be: the management device obtains the fault occurrence time of the starting point and the fault occurrence time of the end point of a first fault propagation relation, the first fault propagation relation is the fault propagation relation extracted from the first sub-graph set, and the plurality of sub-graph sets comprise the first sub-graph set. The management device determines a difference between the fault occurrence time of the start point and the fault occurrence time of the end point of the first fault propagation relationship as a fault propagation time corresponding to the first fault propagation relationship.
Alternatively, the implementation process of the management device determining the fault occurrence time of the starting point and the fault occurrence time of the end point of the first fault propagation relation may be: and determining a sub-graph spectrum sample with a first fault propagation relation from the first sub-graph set, and acquiring the fault occurrence time carried by the abnormal event corresponding to the starting point of the first fault propagation relation and the fault occurrence time carried by the abnormal event corresponding to the end point from the determined sub-graph spectrum sample. And determining the average value of the fault occurrence time carried by the abnormal events corresponding to the starting points as the fault occurrence time of the starting point of the first fault propagation relation, and determining the average value of the fault occurrence time carried by the abnormal events corresponding to the end points as the fault occurrence time of the end point of the first fault propagation relation.
Of course, the management device may further determine a sub-graph spectrum sample in which the first fault propagation relationship occurs from the first sub-graph set, obtain, from the determined sub-graph spectrum sample, a fault occurrence time carried by an abnormal event corresponding to a start point of the first fault propagation relationship and a fault occurrence time carried by an abnormal event corresponding to an end point, determine a difference between the fault occurrence time carried by the abnormal event corresponding to the start point of the obtained first fault propagation relationship and the fault occurrence time carried by the abnormal event corresponding to the end point, and determine an average value of the determined differences as the fault propagation time corresponding to the first fault propagation relationship.
Because the first sub-graph set is one sub-graph set in the multiple sub-graph sets, and the first fault propagation relation is one fault propagation relation extracted from the first sub-graph set, the fault propagation time corresponding to each fault propagation relation extracted from each sub-graph set can be determined according to the method.
For example, the management device extracts 3 fault propagation relations, which are respectively a fault propagation relation 1, a fault propagation relation 2, and a fault propagation relation 3. When the failure occurrence time at the start point of the failure propagation relation 1 is 10: 20: 21 seconds and the failure occurrence time at the end point is 10: 21 minutes, the failure propagation time corresponding to the failure propagation relation 1 is 39 seconds. Similarly, the fault occurrence time at the start point of the fault propagation relation 2 is 10: 23: 02 sec, and the fault occurrence time at the end point is 10: 24: 20 sec, so that the fault propagation time corresponding to the fault propagation relation 2 is 1: 18 sec. When the failure occurrence time at the start point of the failure propagation relation 3 is 10: 22: 10 seconds and the failure occurrence time at the end point is 10: 22: 59 seconds, the failure propagation time corresponding to the failure propagation relation 3 is 49 seconds.
In some embodiments, the implementation process of the management device determining the probability of the occurrence of the fault propagation relationship may be: the management device determines the number of sub-graph spectrum samples with a first fault propagation relation in a first sub-graph set, the first fault propagation relation is a fault propagation relation extracted from the first sub-graph set, and the plurality of sub-graph sets comprise the first sub-graph set. The management device determines the probability of the occurrence of the first fault propagation relationship according to the ratio of the determined number to the total number of sub-graph spectrum samples in the first sub-graph set.
Because the first sub-graph set is one sub-graph set in the multiple sub-graph sets, and the first fault propagation relation is one fault propagation relation extracted from the first sub-graph set, the probability of occurrence of each fault propagation relation extracted from each sub-graph set can be determined according to the method.
As an example, the management device may directly determine a ratio between the determined number and the total number of subgraph spectrum samples in the first subgraph set as the probability of the occurrence of the first fault propagation relationship.
For example, the management device extracts the fault propagation relation 1 from the first sub-graph set, and the number of sub-graph spectrum samples in the first sub-graph set where the fault propagation relation 1 occurs is 20, and the total number of sub-graph spectrum samples in the first sub-graph set is 30, so that the probability of occurrence of the fault propagation relation 1 may be 67%.
Optionally, the management device may group the abnormal network entities on the first knowledge graph based on a graph matching algorithm to obtain n abnormal sub-graphs, and the specific implementation process includes: the management device obtains an abnormal network entity set, wherein the abnormal network entity set comprises all abnormal network entities on the first knowledge graph. And the management equipment repeatedly executes the sub-graph spectrum generation process until the abnormal network entity set is an empty set, so as to obtain n abnormal sub-graph spectrums. The sub-graph spectrum generation flow comprises steps B1-B3:
in step B1, the management device selects a starting abnormal network entity from the set of abnormal network entities.
Alternatively, the initiating abnormal network entity may be any abnormal network entity in the set of abnormal network entities. After the management equipment selects the initial abnormal network entity, the initial abnormal network entity is used as an abnormal sub-map.
In step B2, the management device performs a target matching procedure on the initial abnormal network entity to obtain an abnormal sub-map including the initial abnormal network entity.
The target matching process comprises the following steps:
the management device acquires all target nearest neighbor abnormal network entities of the initial abnormal network entity based on the first knowledge graph, wherein other abnormal network entities do not exist between the target nearest neighbor abnormal network entities and the initial abnormal network entities, and the target nearest neighbor abnormal network entities are not located in the abnormal sub-graph where the initial abnormal network entities are located. For each target nearest neighbor abnormal network entity of the originating abnormal network entity: and when the fault propagation condition is met between the target nearest neighbor abnormal network entity and the initial abnormal network entity, adding the target nearest neighbor abnormal network entity into an abnormal sub-map where the initial abnormal network entity is located, taking the target nearest neighbor abnormal network entity as a new initial abnormal network entity, and executing the target matching process again. And when the fault propagation condition is not satisfied between the target nearest neighbor abnormal network entity and the initial abnormal network entity, determining that the target nearest neighbor abnormal network entity does not belong to the abnormal sub-map in which the initial abnormal network entity is located.
Optionally, when the fault propagation condition is not satisfied between all target nearest neighbor abnormal network entities of the initial abnormal network entity and the initial abnormal network entity, the management device ends the target matching process executed based on the initial abnormal network entity. In the process of executing the target matching process, the management device may delete an abnormal network entity from the abnormal network entity set every time the abnormal network entity is added to the abnormal sub-map where the initial abnormal network entity is located.
In one possible case, when the target nearest-neighbor abnormal network entity and the initiating abnormal network entity are two adjacent network entities, the fault propagation condition between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity is satisfied, including:
the target nearest neighbor abnormal network entity and the initial abnormal network entity have a fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than the target duration, and the fault propagation probability corresponding to the fault propagation relation is greater than the target probability threshold.
In another possible case, when there is a normal network entity between the target nearest-neighbor abnormal network entity and the originating abnormal network entity, the fault propagation condition between the target nearest-neighbor abnormal network entity and the originating abnormal network entity is satisfied, including:
the target nearest neighbor abnormal network entity and the normal network entity have a first fault propagation relation, the initial abnormal network entity and the normal network entity have a second fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than the target duration, and the fault propagation probability corresponding to the first fault propagation relation and the fault propagation probability corresponding to the second fault propagation relation are both greater than the target probability threshold.
Exemplarily, fig. 11 is a schematic diagram of matching of an abnormal sub-map provided in an embodiment of the present application. Taking an abnormal network entity A as an initial abnormal network entity, wherein the abnormal network entity A is provided with 6 nearest neighbor abnormal network entities B-G, and a dotted arrow in the figure represents a fault propagation relation. As shown in fig. 11, the abnormal network entity a satisfies a first-order fault propagation relationship with the abnormal network entity B and the abnormal network entity C, and the abnormal network entity a satisfies a second-order fault propagation relationship with the abnormal network entity E and the abnormal network entity F. When the fault propagation probability corresponding to the fault propagation relation between the abnormal network entity A and the abnormal network entity B is greater than a target probability threshold value and the time interval between the fault occurrence time of the abnormal network entity A and the fault occurrence time of the abnormal network entity B is smaller than a target time length, it can be determined that the fault propagation condition is met between the abnormal network entity B and the abnormal network entity A, and then the abnormal network entity B is determined to belong to an abnormal sub-map in which the abnormal network entity A is located, then the abnormal network entity B can be used as an initial abnormal network node, and other abnormal network entities meeting the fault propagation condition with the abnormal network entity B are matched in the knowledge map. The process of determining whether the abnormal network entity C, the abnormal network entity E, and the abnormal network entity F belong to the abnormal sub-map in which the abnormal network entity a is located may refer to the determination process corresponding to the abnormal network entity B, which is not described in detail herein.
In addition, the abnormal network entity A and the normal network entity X also satisfy a first-order fault propagation relation, and the normal network entity X and the abnormal network entity G satisfy a second-order fault propagation relation. When the fault propagation probability corresponding to the fault propagation relation between the abnormal network entity A and the normal network entity X is greater than the target probability threshold, the fault propagation probability corresponding to the fault propagation relation between the normal network entity X and the abnormal network entity G is greater than the target probability threshold, and the time interval between the fault occurrence time of the abnormal network entity A and the fault occurrence time of the abnormal network entity G is less than the target time length, it can be determined that the fault propagation condition is met between the abnormal network entity G and the abnormal network entity A, and further it is determined that the abnormal network entity G belongs to the abnormal sub-map where the abnormal network entity A is located, then the abnormal network entity G can be used as an initial abnormal network node, and then other abnormal network entities meeting the fault propagation condition with the abnormal network entity G are matched in the knowledge map.
In step B3, the management device deletes all the abnormal network entities in the abnormal sub-map from the abnormal network entity set, resulting in an updated abnormal network entity set.
When the updated abnormal network entity set is not empty, the management equipment continues to execute the sub-graph spectrum generation flow again; and when the updated abnormal network entity set is empty, the management equipment completes the generation process of the abnormal sub-map.
In the embodiment of the application, the knowledge graph is divided into n abnormal sub-graphs, so that the fault propagation conditions are met between the abnormal network entities in each abnormal sub-graph, the faults in the target network are grouped, the management equipment can subsequently perform fault root cause positioning respectively based on each abnormal sub-graph, the scale of the knowledge graph is reduced, and the fault root cause positioning efficiency can be effectively improved.
Step 205, for one or more abnormal sub-maps in the n abnormal sub-maps, determining a root cause failure network entity in the abnormal sub-map.
Alternatively, the management device may determine the root fault network entity in each of the n exception sub-maps separately. The implementation process of the management device for determining the root cause failure network entity in the abnormal sub-map comprises the following steps: the management device calculates the out degree (outdegree) of each abnormal network entity in the abnormal sub-map; and determining the abnormal network entity with the out degree of 0 as the root fault network entity in the abnormal sub-map. The abnormal subgraph spectrum is usually a directed graph, and the degree of occurrence of an abnormal network entity is equal to the number of edges with the abnormal network entity as the tail.
Exemplarily, fig. 12 is a schematic structural diagram of an abnormal sub-map provided in an embodiment of the present application. As shown in fig. 12, the abnormal sub-map includes an abnormal network entity a, an abnormal network entity F and an abnormal network entity H. The degree of the abnormal network entity a is equal to 2, the degree of the abnormal network entity F is equal to 1, and the degree of the abnormal network entity H is equal to 0, so that the management device can determine that the root fault network entity in the abnormal sub-map is the abnormal network entity H.
Notably, one or more root cause failure network entities may be included in each exception sub-graph.
And step 206, determining the probability that each root cause failure network entity in the abnormal sub-map is a failure root cause based on the target path where the root cause failure network entity is located.
The target path is a path with the root cause fault network entity as a tail node. For example, in the exception sub-map shown in fig. 12, the root cause failure network entity H has 2 corresponding target paths, which are: a → X1 → F → H, and A → X2 → F → H.
Optionally, the management device may perform path retrieval based on a Depth First Search (DFS) algorithm to obtain all paths in the abnormal sub-graph, where the path takes the root fault network entity as a tail node. The implementation process of the management device for determining the probability that the root cause fault network entity is the fault root cause based on the target path where the root cause fault network entity is located includes:
in step 2061, the management device determines the fault propagation probability corresponding to the target path where the root cause network entity is located.
Optionally, the management device obtains all fault propagation relations between network entities on the target path; and determining the fault propagation probability corresponding to the target fault propagation relation as the fault propagation probability corresponding to the target path. The target fault propagation relation is the fault propagation relation with the smallest corresponding fault propagation probability in all fault propagation relations.
Illustratively, in the anomaly sub-map shown in fig. 12, each of the 2-entry label paths includes a second-order fault propagation relation and a first-order fault propagation relation. For the target path a → X1 → F → H, assuming that the failure propagation probability corresponding to the second-order failure propagation relationship between the abnormal network entity a and the abnormal network entity F is P1, and the failure propagation probability corresponding to the first-order failure propagation relationship between the abnormal network entity F and the abnormal network entity H is P2, the failure propagation probability corresponding to the target path takes the smaller value of P1 and P2. For the target path a → X2 → F → H, assuming that the failure propagation probability corresponding to the second-order failure propagation relationship between the abnormal network entity a and the abnormal network entity F is P3, and the failure propagation probability corresponding to the first-order failure propagation relationship between the abnormal network entity F and the abnormal network entity H is P2, the failure propagation probability corresponding to the target path takes the smaller value of P3 and P2.
In step 2062, when the number of the target paths where the faulty network entity is located is equal to 1, the management device takes the fault propagation probability corresponding to the target paths as the probability that the faulty network entity is the faulty root.
In step 2063, when the number of the target paths where the failure network entity is located is greater than 1, the management device takes the failure propagation probability corresponding to the specified target path as the probability that the failure network entity is the failure root, and the specified target path is the target path with the smallest failure propagation probability among all the target paths where the failure network entity is located.
Illustratively, referring to the example in step 2061, assuming that P1< P2< P3, the failure probability of the target path a → X1 → F → H is P1, the failure probability of the target path a → X2 → F → H is P2, and the failure probability of the root failure network entity H is P1.
And step 207, outputting the fault root cause of the target network.
The fault root of the target network comprises fault results corresponding to the n abnormal sub-graphs respectively, and the fault results comprise each root fault network entity in the corresponding abnormal sub-graph and the probability that each root fault network entity is the fault root.
Alternatively, the failure root of the target network output by the management device may be expressed as: { fault group a, failure network entity set due to failure, failure probability set corresponding to failure network entity set }, { fault group B, failure network entity set due to failure, failure probability set corresponding to failure network entity set }, … ]. All anomalous network entities in an anomalous sub-map may be included in each failure group. Illustratively, referring to the example in step 206, the fault result corresponding to the abnormal sub-map shown in fig. 12 may be represented as: { { A, F, H }, { H }, { P1} }.
Optionally, the management device outputs a fault result corresponding to the target network to an operation and maintenance support system (OSS) or other terminal devices connected to the management device, and the fault result is provided for the OSS or the terminal devices to display. Of course, if the management device has a display function, the management device may also directly display the knowledge graph of the target network on its own display interface.
In the embodiment of the application, the management device outputs the fault root of the target network, so that operation and maintenance personnel can conveniently check the root fault network entity in the target network, the fast fault root location is realized, the fault repair efficiency is further improved, the time consumed by the network device to be converted from the fault state to the working state can be shortened, and the time consumed by the network device to be converted from the fault state to the working state can also be called Mean Time To Recovery (MTTR).
Optionally, in this embodiment of the present application, the management device may include one device or multiple devices. When the management device comprises a device, the steps related to the fault root cause positioning method are all executed by the device. Alternatively, when the management device includes multiple devices, for example, a first device, a second device, and a third device, the first device may generate an initial knowledge-graph of the target network and identify the anomalous network entity on the initial knowledge-graph of the target network. The second device may train to obtain a set of fault propagation conditions corresponding to the target network based on the knowledge-graph sample. The first device sends the knowledge graph spectrum marked with the abnormal network entity to the third device, and the second device sends the fault propagation condition set to the third device. The third device performs steps 201 to 207.
The step sequence of the fault root cause positioning method provided by the embodiment of the application may be appropriately adjusted, and the steps may be increased or decreased according to the situation, for example, step 202 and step 203 may not be executed. Any method that can be easily conceived by a person skilled in the art within the technical scope disclosed in the present application is covered by the protection scope of the present application, and thus the detailed description thereof is omitted.
In summary, in the fault root cause positioning method provided in the embodiment of the present application, since the knowledge graph is generated based on the entire network, and the fault propagation condition corresponding to the knowledge graph is also based on the entire network, fault propagation between devices can be considered when fault root cause positioning is performed in the network by using the knowledge graph, so that the fault root cause positioning accuracy in the network is improved. By dividing the knowledge graph into n abnormal sub-graphs, fault propagation conditions are met between abnormal network entities in each abnormal sub-graph, fault grouping in a target network is achieved, fault root cause positioning can be respectively carried out on the basis of each abnormal sub-graph in the follow-up process of the management equipment, the scale of the knowledge graph is reduced, and the fault root cause positioning efficiency can be effectively improved.
In addition, in the embodiment of the application, the missing connection relation in the current knowledge graph is determined in a sub-graph comparison mode, and the connection relation with high confidence coefficient is filled in the current knowledge graph, so that the problem that the tracing of the fault root cause is inaccurate finally caused by the missing connection relation on the knowledge graph due to network faults can be solved, and the accuracy of fault root cause positioning is further improved.
Fig. 13 is a schematic structural diagram of a fault root cause locating device according to an embodiment of the present application. It can be applied to the management device 101 in the application scenario as shown in fig. 1. As shown in fig. 13, the apparatus 130 includes:
the first obtaining module 1301 is configured to obtain a first knowledge graph of a target network that fails, where an abnormal network entity that generates an abnormal event in the target network is identified on the first knowledge graph, and a type of the network entity on the first knowledge graph is a network device, an interface, a protocol, or a service.
The first generating module 1302 is configured to generate n abnormal sub-maps based on the first knowledge-map, each abnormal sub-map includes one or more abnormal network entities, when the abnormal sub-map includes multiple abnormal network entities, a fault propagation condition is satisfied between any abnormal network entity in the abnormal sub-map and one or more other abnormal network entities in the abnormal sub-map, the n abnormal sub-maps include all abnormal network entities on the first knowledge-map, and any abnormal network entity belongs to only one abnormal sub-map, and n is a positive integer.
A first determining module 1303, configured to determine, for one or more of the n exception sub-maps, a root cause failure network entity in the exception sub-map.
In summary, according to the fault root cause positioning device provided by the embodiment of the present application, since the knowledge graph is generated based on the entire network, and the fault propagation condition corresponding to the knowledge graph is also based on the entire network, fault propagation between devices can be considered when fault root cause positioning is performed in the network by using the knowledge graph, and the fault root cause positioning accuracy in the network is improved. By dividing the knowledge graph into n abnormal sub-graphs, fault propagation conditions are met between abnormal network entities in each abnormal sub-graph, fault grouping in a target network is achieved, fault root cause positioning can be respectively carried out on the basis of each abnormal sub-graph in the follow-up process of the management equipment, the scale of the knowledge graph is reduced, and the fault root cause positioning efficiency can be effectively improved.
Optionally, the fault propagation conditions comprise one or more of a fault propagation relationship, a fault propagation time condition and a fault propagation probability condition.
Optionally, the first generating module is configured to: acquiring an abnormal network entity set, wherein the abnormal network entity set comprises all abnormal network entities on a first knowledge graph; repeatedly executing the sub-graph spectrum generation process until the abnormal network entity set is an empty set to obtain n abnormal sub-graph spectrums, wherein the sub-graph spectrum generation process comprises the following steps:
selecting a starting abnormal network entity from the abnormal network entity set; executing a target matching process on the initial abnormal network entity to obtain an abnormal sub-map comprising the initial abnormal network entity; deleting all abnormal network entities in the abnormal sub-map from the abnormal network entity set to obtain an updated abnormal network entity set; the target matching process comprises the following steps:
based on the first knowledge graph, all target nearest neighbor abnormal network entities of the initial abnormal network entities are obtained, other abnormal network entities do not exist between the target nearest neighbor abnormal network entities and the initial abnormal network entities, the target nearest neighbor abnormal network entities are not located in an abnormal sub-graph where the initial abnormal network entities are located, and for each target nearest neighbor abnormal network entity:
when the fault propagation condition is satisfied between the target nearest neighbor abnormal network entity and the initial abnormal network entity, adding the target nearest neighbor abnormal network entity into an abnormal sub-map where the initial abnormal network entity is located, taking the target nearest neighbor abnormal network entity as a new initial abnormal network entity, executing the target matching process again, and when the fault propagation condition is not satisfied between the target nearest neighbor abnormal network entity and the initial abnormal network entity, determining that the target nearest neighbor abnormal network entity does not belong to the abnormal sub-map where the initial abnormal network entity is located.
Optionally, when the target nearest-neighbor abnormal network entity and the initiating abnormal network entity are two adjacent network entities, the fault propagation condition between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity is satisfied, including:
the target nearest neighbor abnormal network entity and the initial abnormal network entity have a fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than the target duration, and the fault propagation probability corresponding to the fault propagation relation is greater than the target probability threshold.
Optionally, when there is a normal network entity between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, the fault propagation condition between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity is satisfied, including:
the target nearest neighbor abnormal network entity and the normal network entity have a first fault propagation relation, the initial abnormal network entity and the normal network entity have a second fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than the target duration, and the fault propagation probability corresponding to the first fault propagation relation and the fault propagation probability corresponding to the second fault propagation relation are both greater than the target probability threshold.
Optionally, as shown in fig. 14, the apparatus 130 further includes:
a second obtaining module 1304, configured to obtain m second knowledge maps of the target network corresponding to m times after obtaining the first knowledge map of the failed target network, where the m times are in one-to-one correspondence with the m second knowledge maps, the m times are different from the generation times of the first knowledge map, and m is a positive integer;
and a completion module 1305, configured to complete the network entity connection relationship in the first knowledge graph according to the m second knowledge graphs.
Optionally, a completion module to: executing a connection relation completion process for each abnormal network entity in the first knowledge graph respectively, wherein the connection relation completion process comprises the following steps:
acquiring a first sub-map from the first knowledge map, wherein the first sub-map comprises an abnormal network entity and all network entities having connection relations with the abnormal network entity, and the connection relations comprise direct connection relations and/or indirect connection relations; respectively acquiring a second sub-map from each second knowledge map according to the identification of the abnormal network entity to obtain m second sub-maps, wherein the second sub-maps comprise a target network entity and all network entities which have connection relations with the target network entity, and the identification of the target network entity is the same as the identification of the abnormal network entity; obtaining a target connection relation based on the first sub-graph spectrum and the m second sub-graph spectrums, wherein the target connection relation satisfies the following conditions: the first sub-map does not include target connection relations, and one or more second sub-maps in the m second sub-maps include target connection relations; and when the confidence degree of the target connection relation is greater than the confidence degree threshold value, adding the target connection relation into the first sub-map, wherein the confidence degree of the target connection relation is positively correlated with the occurrence times of the target connection relation in the m second sub-maps.
Optionally, the number of occurrences of the target connection relation in the m second sub-maps is c, the confidence of the target connection relation is equal to c/m, and c is a positive integer.
Optionally, the first determining module is configured to:
for each abnormal sub-map, calculating the out-degree of each abnormal network entity in the abnormal sub-map; and determining the abnormal network entity with the out degree of 0 as the root fault network entity in the abnormal sub-map.
Optionally, as shown in fig. 15, the apparatus 130 further includes:
a second determining module 1306, configured to, after determining the root cause failure network entity in the abnormal sub-map, determine, for each root cause failure network entity in the abnormal sub-map, a probability that the root cause failure network entity is a failure root cause based on a target path where the root cause failure network entity is located, where the target path is a path where the root cause failure network entity is a tail node;
an output module 1307, configured to output a fault root cause of the target network, where the fault root cause includes fault results corresponding to the n abnormal sub-graphs, and the fault result includes each root cause faulty network entity in the abnormal sub-graph and a probability that each root cause faulty network entity is a fault root cause.
Optionally, the second determining module is configured to:
determining the fault propagation probability corresponding to the target path where the root cause fault network entity is located; when the number of the target paths where the fault network entities are located is equal to 1, taking the fault propagation probability corresponding to the target paths as the probability that the fault network entities are fault root factors; when the number of the target paths where the fault network entity is located is greater than 1, the fault propagation probability corresponding to the specified target path is used as the probability that the fault network entity is the fault root, and the specified target path is the target path with the minimum fault propagation probability in all the target paths where the fault network entity is located.
Optionally, the second determining module is further configured to:
acquiring all fault propagation relations between network entities on a target path; and determining the fault propagation probability corresponding to the target fault propagation relation as the fault propagation probability corresponding to the target path, wherein the target fault propagation relation is the fault propagation relation with the minimum fault propagation probability in all the fault propagation relations.
Optionally, the first obtaining module is configured to:
when a target network fails, acquiring an abnormal event generated in the target network; the method comprises the steps of identifying an abnormal network entity generating an abnormal event in a target network on an initial knowledge graph of the target network to obtain a first knowledge graph, wherein the initial knowledge graph is generated based on network data of the target network, the network data comprises networking topology of the target network and equipment information of a plurality of network equipment in the target network, and the equipment information comprises one or more of interface configuration information, protocol configuration information and service configuration information.
Optionally, as shown in fig. 16, the apparatus 130 further includes:
a third obtaining module 1308, configured to obtain network data of a target network when the target network fails;
an extracting module 1309, configured to extract a plurality of knowledge-graph triples from the network data, where each knowledge-graph triplet includes two network entities and a relationship between the two network entities;
a second generation module 1310 configured to generate an initial knowledge-graph based on the plurality of knowledge-graph triples.
Optionally, the exception event carries a fault occurrence time of an exception network entity that generates the exception event.
Optionally, the abnormal event includes one or more of an alarm log, a status change log, and an abnormal key performance indicator.
In summary, according to the fault root cause positioning device provided by the embodiment of the present application, since the knowledge graph is generated based on the entire network, and the fault propagation condition corresponding to the knowledge graph is also based on the entire network, fault propagation between devices can be considered when fault root cause positioning is performed in the network by using the knowledge graph, and the fault root cause positioning accuracy in the network is improved. By dividing the knowledge graph into n abnormal sub-graphs, fault propagation conditions are met between abnormal network entities in each abnormal sub-graph, fault grouping in a target network is achieved, fault root cause positioning can be respectively carried out on the basis of each abnormal sub-graph in the follow-up process of the management equipment, the scale of the knowledge graph is reduced, and the fault root cause positioning efficiency can be effectively improved.
In addition, in the embodiment of the application, the missing connection relation in the current knowledge graph is determined in a sub-graph comparison mode, and the connection relation with high confidence coefficient is filled in the current knowledge graph, so that the problem that the tracing of the fault root cause is inaccurate finally caused by the missing connection relation on the knowledge graph due to network faults can be solved, and the accuracy of fault root cause positioning is further improved.
Fig. 17 is a block diagram of a fault root cause locating device according to an embodiment of the present application. The fault root cause locating device may be a management device in an application scenario as shown in fig. 1. As shown in fig. 17, the management apparatus 170 includes: a processor 1701 and a memory 1702.
A memory 1702 for storing a computer program, the computer program comprising program instructions;
the processor 1701 is configured to invoke a computer program to implement the fault root cause locating method shown in fig. 2.
Optionally, the management device 170 further includes a communication bus 1703 and a communication interface 1704.
Where the processor 1701 includes one or more processing cores, the processor 1701 executes various functional applications and fault root location by running a computer program.
The memory 1702 may be used to store computer programs. Alternatively, the memory may store an operating system and application program elements required for at least one function. The operating system may be a Real Time eXceptive (RTX) operating system, such as LINUX, UNIX, WINDOWS, or OS X.
Communication interface 1704 may be multiple, communication interface 1704 being used to communicate with other devices. For example, with a control device or a network device.
The memory 1702 and the communication interface 1704 are connected to the processor 1701 via a communication bus 1703, respectively.
An embodiment of the present application further provides a computer storage medium, where instructions are stored on the computer storage medium, and when the instructions are executed by a processor, the method for locating a fault root cause as shown in fig. 2 is implemented.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
In the embodiments of the present application, the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "at least one" means one or more, and the term "plurality" means two or more, unless expressly defined otherwise.
The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The above description is only exemplary of the present application and is not intended to limit the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims (34)

1. A method for locating a fault root cause, the method comprising:
acquiring a first knowledge graph of a target network with a fault, wherein an abnormal network entity generating an abnormal event in the target network is identified on the first knowledge graph, and the type of the network entity on the first knowledge graph is network equipment, an interface, a protocol or a service;
generating n abnormal sub-maps based on the first knowledge-map, wherein each abnormal sub-map comprises one or more abnormal network entities, when the abnormal sub-map comprises a plurality of abnormal network entities, fault propagation conditions are met between any abnormal network entity in the abnormal sub-map and one or more other abnormal network entities in the abnormal sub-map, the n abnormal sub-maps comprise all abnormal network entities on the first knowledge-map, any abnormal network entity only belongs to one abnormal sub-map, and n is a positive integer;
for one or more of the n anomaly sub-maps, determining a root cause failure network entity in the anomaly sub-map.
2. The method of claim 1, wherein the fault propagation conditions include one or more of a fault propagation relationship, a fault propagation time condition, and a fault propagation probability condition.
3. The method according to claim 1 or 2, wherein the generating n anomaly sub-maps based on the first knowledge-map comprises:
acquiring an abnormal network entity set, wherein the abnormal network entity set comprises all abnormal network entities on the first knowledge graph;
repeatedly executing a sub-graph spectrum generation process until the abnormal network entity set is an empty set to obtain the n abnormal sub-graph spectrums, wherein the sub-graph spectrum generation process comprises the following steps:
selecting a starting abnormal network entity from the abnormal network entity set;
executing a target matching process on the initial abnormal network entity to obtain an abnormal sub-map comprising the initial abnormal network entity;
deleting all abnormal network entities in the abnormal sub-map from the abnormal network entity set to obtain an updated abnormal network entity set;
wherein the target matching process comprises:
based on the first knowledge graph, acquiring all target nearest neighbor abnormal network entities of the initial abnormal network entity, wherein no other abnormal network entity exists between the target nearest neighbor abnormal network entity and the initial abnormal network entity, and the target nearest neighbor abnormal network entity is not located in an abnormal sub-graph in which the initial abnormal network entity is located,
for each of the target nearest neighbor abnormal network entities:
when the target nearest neighbor abnormal network entity and the initial abnormal network entity meet the fault propagation condition, adding the target nearest neighbor abnormal network entity into the abnormal sub-map where the initial abnormal network entity is located, taking the target nearest neighbor abnormal network entity as a new initial abnormal network entity, and executing the target matching process again,
and when the fault propagation condition is not satisfied between the target nearest neighbor abnormal network entity and the initial abnormal network entity, determining that the target nearest neighbor abnormal network entity does not belong to the abnormal sub-map where the initial abnormal network entity is located.
4. The method according to claim 3, wherein when the target nearest-neighbor abnormal network entity and the initiating abnormal network entity are two adjacent network entities, a fault propagation condition is satisfied between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, including:
the target nearest neighbor abnormal network entity and the initial abnormal network entity have a fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than a target duration, and the fault propagation probability corresponding to the fault propagation relation is greater than a target probability threshold.
5. The method according to claim 3, wherein when there is a normal network entity between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, a fault propagation condition is satisfied between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, including:
the target nearest neighbor abnormal network entity and the normal network entity have a first fault propagation relation, the initial abnormal network entity and the normal network entity have a second fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than the target duration, and the fault propagation probability corresponding to the first fault propagation relation and the fault propagation probability corresponding to the second fault propagation relation are both greater than the target probability threshold.
6. The method of claim 1 or 2, wherein after the obtaining the first knowledge-graph of the failed target network, the method further comprises:
acquiring m second knowledge graphs of the target network corresponding to m moments, wherein the m moments correspond to the m second knowledge graphs one by one, the m moments are different from the generation moments of the first knowledge graphs, and m is a positive integer;
and completing the network entity connection relation in the first knowledge graph according to the m second knowledge graphs.
7. The method of claim 6, wherein the populating network entity connection relationships in the first knowledge-graph from the m second knowledge-graphs comprises:
executing a connection relation completion process for each abnormal network entity in the first knowledge graph respectively, wherein the connection relation completion process comprises the following steps:
acquiring a first sub-map from the first knowledge map, wherein the first sub-map comprises the abnormal network entity and all network entities having connection relations with the abnormal network entity, and the connection relations comprise direct connection relations and/or indirect connection relations;
according to the identification of the abnormal network entity, respectively obtaining a second sub-map from each second knowledge map to obtain m second sub-maps, wherein the second sub-maps comprise a target network entity and all network entities having the connection relation with the target network entity, and the identification of the target network entity is the same as the identification of the abnormal network entity;
obtaining a target connection relation based on the first sub-graph spectrum and the m second sub-graph spectrums, wherein the target connection relation satisfies the following conditions: the target connection relation is not included in the first sub-map, and one or more second sub-maps in the m second sub-maps include the target connection relation;
when the confidence of the target connection relation is larger than a confidence threshold value, adding the target connection relation into the first sub-map, wherein the confidence of the target connection relation is positively correlated with the occurrence times of the target connection relation in the m second sub-maps.
8. The method according to claim 7, wherein the number of occurrences of the target connection relation in the m second sub-maps is c, the confidence of the target connection relation is equal to c/m, and c is a positive integer.
9. The method of claim 1 or 2, wherein the determining a root cause failure network entity in the anomaly sub-graph comprises:
calculating the out degree of each abnormal network entity in the abnormal sub-map;
and determining the abnormal network entity with the out degree of 0 as the root cause failure network entity in the abnormal sub-map.
10. The method of claim 1 or 2, wherein after the determining a root failing network entity in the anomaly sub-graph, the method further comprises:
for each root cause failure network entity in the abnormal sub-map, determining the probability that the root cause failure network entity is a failure root cause based on a target path where the root cause failure network entity is located, wherein the target path is a path taking the root cause failure network entity as a tail node;
outputting a fault root cause of the target network, where the fault root cause includes fault results corresponding to the n abnormal sub-maps, and the fault results include each root cause faulty network entity in the abnormal sub-map and a probability that each root cause faulty network entity is a fault root cause.
11. The method of claim 10, wherein determining the probability of the failed network entity being the failure root based on the target path of the failed network entity comprises:
determining the fault propagation probability corresponding to the target path where the root fault network entity is located;
when the number of the target paths where the root cause fault network entity is located is equal to 1, taking the fault propagation probability corresponding to the target paths as the probability that the root cause fault network entity is a fault root cause;
and when the number of the target paths where the root cause fault network entity is located is greater than 1, taking the fault propagation probability corresponding to the specified target path as the probability that the root cause fault network entity is the fault root cause, wherein the specified target path is the target path with the minimum fault propagation probability in all the target paths where the root cause fault network entity is located.
12. The method of claim 11, wherein the determining the fault propagation probability corresponding to the target path where the root fault network entity is located comprises:
acquiring all fault propagation relations between network entities on the target path;
and determining the fault propagation probability corresponding to the target fault propagation relation as the fault propagation probability corresponding to the target path, wherein the target fault propagation relation is the fault propagation relation with the minimum corresponding fault propagation probability in all the fault propagation relations.
13. The method of claim 1 or 2, wherein obtaining the first knowledge-graph of the failed target network comprises:
when the target network fails, acquiring an abnormal event generated in the target network;
identifying an abnormal network entity which generates the abnormal event in the target network on an initial knowledge graph of the target network to obtain the first knowledge graph, wherein the initial knowledge graph is generated based on network data of the target network, the network data comprises networking topology of the target network and equipment information of a plurality of network equipment in the target network, and the equipment information comprises one or more of interface configuration information, protocol configuration information and service configuration information.
14. The method of claim 13, further comprising:
when the target network fails, network data of the target network is acquired;
extracting a plurality of knowledge-graph triplets from the network data, each knowledge-graph triplet including two network entities and a relationship between the two network entities;
and generating the initial knowledge graph according to the plurality of knowledge graph triples.
15. The method according to claim 13, wherein the exception event carries a failure occurrence time of an exception network entity that generated the exception event.
16. The method of claim 1 or 2, wherein the abnormal events include one or more of an alarm log, a status change log, and an abnormal key performance indicator.
17. A fault root cause locating device, the device comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a first knowledge graph of a target network with a fault, an abnormal network entity which generates an abnormal event in the target network is marked on the first knowledge graph, and the type of the network entity on the first knowledge graph is network equipment, an interface, a protocol or service;
a first generation module, configured to generate n abnormal sub-maps based on the first knowledge-map, where each abnormal sub-map includes one or more abnormal network entities, and when an abnormal sub-map includes multiple abnormal network entities, a fault propagation condition is satisfied between any abnormal network entity in the abnormal sub-map and one or more other abnormal network entities in the abnormal sub-map, where the n abnormal sub-maps include all abnormal network entities on the first knowledge-map, and any abnormal network entity belongs to only one abnormal sub-map, and n is a positive integer;
a first determining module, configured to determine, for one or more exception sub-maps of the n exception sub-maps, a root cause failure network entity in the exception sub-map.
18. The apparatus of claim 17, wherein the fault propagation conditions comprise one or more of a fault propagation relationship, a fault propagation time condition, and a fault propagation probability condition.
19. The apparatus of claim 17 or 18, wherein the first generating module is configured to:
acquiring an abnormal network entity set, wherein the abnormal network entity set comprises all abnormal network entities on the first knowledge graph;
repeatedly executing a sub-graph spectrum generation process until the abnormal network entity set is an empty set to obtain the n abnormal sub-graph spectrums, wherein the sub-graph spectrum generation process comprises the following steps:
selecting a starting abnormal network entity from the abnormal network entity set;
executing a target matching process on the initial abnormal network entity to obtain an abnormal sub-map comprising the initial abnormal network entity;
deleting all abnormal network entities in the abnormal sub-map from the abnormal network entity set to obtain an updated abnormal network entity set;
wherein the target matching process comprises:
based on the first knowledge graph, acquiring all target nearest neighbor abnormal network entities of the initial abnormal network entity, wherein no other abnormal network entity exists between the target nearest neighbor abnormal network entity and the initial abnormal network entity, and the target nearest neighbor abnormal network entity is not located in an abnormal sub-graph in which the initial abnormal network entity is located,
for each of the target nearest neighbor abnormal network entities:
when the target nearest neighbor abnormal network entity and the initial abnormal network entity meet the fault propagation condition, adding the target nearest neighbor abnormal network entity into the abnormal sub-map where the initial abnormal network entity is located, taking the target nearest neighbor abnormal network entity as a new initial abnormal network entity, and executing the target matching process again,
and when the fault propagation condition is not satisfied between the target nearest neighbor abnormal network entity and the initial abnormal network entity, determining that the target nearest neighbor abnormal network entity does not belong to the abnormal sub-map where the initial abnormal network entity is located.
20. The apparatus of claim 19, wherein when the target nearest-neighbor abnormal network entity and the initiating abnormal network entity are two adjacent network entities, a fault propagation condition is satisfied between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, comprising:
the target nearest neighbor abnormal network entity and the initial abnormal network entity have a fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than a target duration, and the fault propagation probability corresponding to the fault propagation relation is greater than a target probability threshold.
21. The apparatus of claim 19, wherein when there is a normal network entity between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, a fault propagation condition is satisfied between the target nearest-neighbor abnormal network entity and the initiating abnormal network entity, comprising:
the target nearest neighbor abnormal network entity and the normal network entity have a first fault propagation relation, the initial abnormal network entity and the normal network entity have a second fault propagation relation, the time interval between the fault occurrence time of the target nearest neighbor abnormal network entity and the fault occurrence time of the initial abnormal network entity is less than the target duration, and the fault propagation probability corresponding to the first fault propagation relation and the fault propagation probability corresponding to the second fault propagation relation are both greater than the target probability threshold.
22. The apparatus of claim 17 or 18, further comprising:
a second obtaining module, configured to obtain m second knowledge maps of the target network corresponding to m times after obtaining the first knowledge map of the failed target network, where the m times are in one-to-one correspondence with the m second knowledge maps, the m times are different from generation times of the first knowledge map, and m is a positive integer;
and the supplementing module is used for supplementing the network entity connection relation in the first knowledge graph according to the m second knowledge graphs.
23. The apparatus of claim 22, wherein the replenishment module is to:
executing a connection relation completion process for each abnormal network entity in the first knowledge graph respectively, wherein the connection relation completion process comprises the following steps:
acquiring a first sub-map from the first knowledge map, wherein the first sub-map comprises the abnormal network entity and all network entities having connection relations with the abnormal network entity, and the connection relations comprise direct connection relations and/or indirect connection relations;
according to the identification of the abnormal network entity, respectively obtaining a second sub-map from each second knowledge map to obtain m second sub-maps, wherein the second sub-maps comprise a target network entity and all network entities having the connection relation with the target network entity, and the identification of the target network entity is the same as the identification of the abnormal network entity;
obtaining a target connection relation based on the first sub-graph spectrum and the m second sub-graph spectrums, wherein the target connection relation satisfies the following conditions: the target connection relation is not included in the first sub-map, and one or more second sub-maps in the m second sub-maps include the target connection relation;
when the confidence of the target connection relation is larger than a confidence threshold value, adding the target connection relation into the first sub-map, wherein the confidence of the target connection relation is positively correlated with the occurrence times of the target connection relation in the m second sub-maps.
24. The apparatus according to claim 23, wherein the number of occurrences of the target connection relation in the m second sub-maps is c, the confidence of the target connection relation is equal to c/m, and c is a positive integer.
25. The apparatus of claim 17 or 18, wherein the first determining module is configured to:
calculating the out degree of each abnormal network entity in the abnormal sub-map;
and determining the abnormal network entity with the out degree of 0 as the root cause failure network entity in the abnormal sub-map.
26. The apparatus of claim 17 or 18, further comprising:
a second determining module, configured to determine, for each root cause failure network entity in the abnormal sub-map, a probability that the root cause failure network entity is a failure root cause based on a target path where the root cause failure network entity is located, where the target path is a path where the root cause failure network entity is a tail node, after determining the root cause failure network entity in the abnormal sub-map;
an output module, configured to output a fault root cause of the target network, where the fault root cause includes fault results corresponding to the n abnormal sub-maps, and the fault results include each root cause faulty network entity in the abnormal sub-map and a probability that each root cause faulty network entity is a fault root cause.
27. The apparatus of claim 26, wherein the second determining module is configured to:
determining the fault propagation probability corresponding to the target path where the root fault network entity is located;
when the number of the target paths where the root cause fault network entity is located is equal to 1, taking the fault propagation probability corresponding to the target paths as the probability that the root cause fault network entity is a fault root cause;
and when the number of the target paths where the root cause fault network entity is located is greater than 1, taking the fault propagation probability corresponding to the specified target path as the probability that the root cause fault network entity is the fault root cause, wherein the specified target path is the target path with the minimum fault propagation probability in all the target paths where the root cause fault network entity is located.
28. The apparatus of claim 27, wherein the second determining module is further configured to:
acquiring all fault propagation relations between network entities on the target path;
and determining the fault propagation probability corresponding to the target fault propagation relation as the fault propagation probability corresponding to the target path, wherein the target fault propagation relation is the fault propagation relation with the minimum corresponding fault propagation probability in all the fault propagation relations.
29. The apparatus of claim 17 or 18, wherein the first obtaining module is configured to:
when the target network fails, acquiring an abnormal event generated in the target network;
identifying an abnormal network entity which generates the abnormal event in the target network on an initial knowledge graph of the target network to obtain the first knowledge graph, wherein the initial knowledge graph is generated based on network data of the target network, the network data comprises networking topology of the target network and equipment information of a plurality of network equipment in the target network, and the equipment information comprises one or more of interface configuration information, protocol configuration information and service configuration information.
30. The apparatus of claim 29, further comprising:
the third acquisition module is used for acquiring the network data of the target network when the target network fails;
an extraction module configured to extract a plurality of knowledge-graph triples from the network data, each of the knowledge-graph triples including two network entities and a relationship between the two network entities;
a second generating module configured to generate the initial knowledge-graph according to the plurality of knowledge-graph triples.
31. The apparatus according to claim 29, wherein the exception event carries a failure occurrence time of an exception network entity that generated the exception event.
32. The apparatus of claim 17 or 18, wherein the abnormal events comprise one or more of an alarm log, a status change log, and an abnormal key performance indicator.
33. A fault root cause locating device, comprising: a processor and a memory;
the memory for storing a computer program, the computer program comprising program instructions;
the processor is configured to invoke the computer program to implement the fault root cause location method according to any one of claims 1 to 16.
34. A computer storage medium having stored thereon instructions which, when executed by a processor, carry out a method of fault root location according to any one of claims 1 to 16.
CN201911096747.0A 2019-11-11 2019-11-11 Fault root cause positioning method and device and computer storage medium Active CN112787841B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911096747.0A CN112787841B (en) 2019-11-11 2019-11-11 Fault root cause positioning method and device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911096747.0A CN112787841B (en) 2019-11-11 2019-11-11 Fault root cause positioning method and device and computer storage medium

Publications (2)

Publication Number Publication Date
CN112787841A CN112787841A (en) 2021-05-11
CN112787841B true CN112787841B (en) 2022-04-05

Family

ID=75749289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911096747.0A Active CN112787841B (en) 2019-11-11 2019-11-11 Fault root cause positioning method and device and computer storage medium

Country Status (1)

Country Link
CN (1) CN112787841B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113328872B (en) 2020-02-29 2023-03-28 华为技术有限公司 Fault repairing method, device and storage medium
CN113032238B (en) * 2021-05-25 2021-08-17 南昌惠联网络技术有限公司 Real-time root cause analysis method based on application knowledge graph
CN113098723B (en) * 2021-06-07 2021-09-17 新华三人工智能科技有限公司 Fault root cause positioning method and device, storage medium and equipment
CN113595994B (en) * 2021-07-12 2023-03-21 深信服科技股份有限公司 Abnormal mail detection method and device, electronic equipment and storage medium
CN113434326A (en) * 2021-07-12 2021-09-24 国泰君安证券股份有限公司 Method and device for realizing network system fault positioning based on distributed cluster topology, processor and computer readable storage medium thereof
CN114422325A (en) * 2021-12-30 2022-04-29 优刻得科技股份有限公司 Content distribution network abnormity positioning method, device, equipment and storage medium
CN114430365B (en) * 2022-04-06 2022-07-29 北京宝兰德软件股份有限公司 Fault root cause analysis method, device, electronic equipment and storage medium
CN114978877B (en) * 2022-05-13 2024-04-05 京东科技信息技术有限公司 Abnormality processing method, abnormality processing device, electronic equipment and computer readable medium
CN115277453A (en) * 2022-06-13 2022-11-01 北京宝兰德软件股份有限公司 Method for generating abnormal knowledge graph in operation and maintenance field, application method and device
CN115756929B (en) * 2022-11-23 2023-06-02 北京大学 Abnormal root cause positioning method and system based on dynamic service dependency graph
CN116467468B (en) * 2023-05-05 2024-01-05 国网浙江省电力有限公司 Power management system abnormal information handling method based on knowledge graph technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522192A (en) * 2018-10-17 2019-03-26 北京航空航天大学 A kind of prediction technique of knowledge based map and complex network combination
CN109992440A (en) * 2019-04-02 2019-07-09 北京睿至大数据有限公司 A kind of IT root accident analysis recognition methods of knowledge based map and machine learning
CN110008288A (en) * 2019-02-19 2019-07-12 武汉烽火技术服务有限公司 The construction method in the knowledge mapping library for Analysis of Network Malfunction and its application

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286504A1 (en) * 2018-03-15 2019-09-19 Ca, Inc. Graph-based root cause analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522192A (en) * 2018-10-17 2019-03-26 北京航空航天大学 A kind of prediction technique of knowledge based map and complex network combination
CN110008288A (en) * 2019-02-19 2019-07-12 武汉烽火技术服务有限公司 The construction method in the knowledge mapping library for Analysis of Network Malfunction and its application
CN109992440A (en) * 2019-04-02 2019-07-09 北京睿至大数据有限公司 A kind of IT root accident analysis recognition methods of knowledge based map and machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向故障分析的知识图谱构建技术研究;刘鑫;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20190815;全文 *

Also Published As

Publication number Publication date
CN112787841A (en) 2021-05-11

Similar Documents

Publication Publication Date Title
CN112787841B (en) Fault root cause positioning method and device and computer storage medium
CN112887119B (en) Fault root cause determination method and device and computer storage medium
WO2022083540A1 (en) Method, apparatus, and system for determining fault recovery plan, and computer storage medium
CN112491636B (en) Data processing method and device and computer storage medium
US20110093579A1 (en) Apparatus and system for estimating network configuration
CN111404822B (en) Data transmission method, device, equipment and computer readable storage medium
US7808888B2 (en) Network fault correlation in multi-route configuration scenarios
CN113225194B (en) Routing abnormity detection method, device and system and computer storage medium
US10764214B1 (en) Error source identification in cut-through networks
CN113852476A (en) Method, device and system for determining abnormal event associated object
CN113868367A (en) Method, device and system for constructing knowledge graph and computer storage medium
US9893979B2 (en) Network topology discovery by resolving loops
Lad et al. An algorithmic approach to identifying link failures
US20040158780A1 (en) Method and system for presenting neighbors of a device in a network via a graphical user interface
US10148515B2 (en) Determining connections of non-external network facing ports
CN113190368A (en) Method, device and system for realizing table item check and computer storage medium
US20160344571A1 (en) Determining Connections Between Disconnected Partial Trees
CN116248479A (en) Network path detection method, device, equipment and storage medium
US20220200860A1 (en) Mitigation of physical network misconfigurations for clustered nodes
CN113271216B (en) Data processing method and related equipment
US9158871B2 (en) Graph modeling systems and methods
CN112468400A (en) Fault positioning method, device, equipment and medium
Jones Vulnerability Analysis of the Physical and Logical Network Topology on the US Virgin Islands
CN116684262A (en) Method and device for acquiring fault propagation relationship
CN114519095A (en) Data processing method, device and system and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant