WO2021052380A1 - 提取故障传播条件的方法、装置及存储介质 - Google Patents

提取故障传播条件的方法、装置及存储介质 Download PDF

Info

Publication number
WO2021052380A1
WO2021052380A1 PCT/CN2020/115701 CN2020115701W WO2021052380A1 WO 2021052380 A1 WO2021052380 A1 WO 2021052380A1 CN 2020115701 W CN2020115701 W CN 2020115701W WO 2021052380 A1 WO2021052380 A1 WO 2021052380A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
fault propagation
sub
condition
propagation condition
Prior art date
Application number
PCT/CN2020/115701
Other languages
English (en)
French (fr)
Inventor
肖欣
谢于明
王仲宇
高云鹏
马凯
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20865768.4A priority Critical patent/EP4024765B1/en
Publication of WO2021052380A1 publication Critical patent/WO2021052380A1/zh
Priority to US17/655,107 priority patent/US20220207383A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence

Definitions

  • This application relates to the field of communication technology, and further relates to the application of artificial intelligence (AI) in the field of communication technology, and in particular to a method, device and storage medium for extracting fault propagation conditions.
  • AI artificial intelligence
  • the network device obtains multiple event object connection diagrams corresponding to the communication network at multiple different times, and the multiple different times correspond to the multiple event object connection diagrams one-to-one, and each of the multiple event object connection diagrams
  • the event object connection diagram is used to describe fault-related events that occur in the communication network, and the connection relationship between objects related to the event;
  • the network device determines a plurality of subgraphs according to the plurality of event object connection graphs, the plurality of subgraphs correspond to the plurality of event object connection graphs one-to-one, and each subgraph of the plurality of subgraphs is a corresponding event A subset of the object connection graph, where the number of hops between an object that generates a first event in each subgraph of the plurality of subgraphs and any object related to the first event is not greater than N, and the event includes the first event.
  • the N is an integer greater than or equal to 1;
  • the network device determines a fault propagation condition according to the multiple updated subgraphs, and the fault propagation condition is used to indicate a path through which the fault is propagated in the communication network.
  • the aforementioned multiple different times may refer to multiple different moments or multiple different time periods.
  • the aforementioned multiple different times may also include both moments and time periods. That is, the above multiple event object connection diagrams may all be event object connection diagrams corresponding to different moments, or all of the event object connection diagrams corresponding to different time periods, or part of the event object connection diagrams corresponding to different moments. , The other part is the connection diagram of event objects corresponding to different time periods.
  • the network device clusters the plurality of updated subgraphs according to the similarity and the clustering algorithm to obtain the plurality of subgraph sets.
  • the network device determining the fault propagation condition according to the multiple update subgraphs includes:
  • the method further includes:
  • the network device determines the alarm occurrence time of the starting point and the alarm occurrence time of the end point of a first fault propagation condition, where the first fault propagation condition is a fault propagation condition extracted from a first set of subgraphs, and the plurality of sets of subgraphs Including the first set of sub-pictures;
  • the network device determines the difference between the alarm occurrence time at the starting point of the first fault propagation condition and the alarm occurrence time at the end point as the fault propagation time corresponding to the first fault propagation condition.
  • the network device selects, from the fault propagation conditions, a second fault propagation condition whose destination is the object of the current fault alarm and that can match the updated subgraph of the communication network at the current time;
  • the method further includes:
  • each fault propagation condition corresponds to a probability, and there is usually only one fault source, when the number of fault propagation conditions that meet the condition is greater than 1, the fault propagation condition with the highest probability can be selected from the fault propagation conditions that meet the conditions, and the The starting point of the most probable fault propagation condition is determined as the fault source of the current fault alarm.
  • the fault propagation condition is extracted by the network device from the updated subgraph included in each of the multiple subgraph sets according to a frequent subgraph mining algorithm;
  • the network device determines the number of updated subgraphs in a first subgraph set where a first fault propagation condition appears, where the first fault propagation condition is a fault propagation condition extracted from the first subgraph set, and
  • the multiple sub-picture sets include the first sub-picture set;
  • the network device determines the probability of the occurrence of the first fault propagation condition according to the ratio between the number and the total number of updated subgraphs in the first subgraph set.
  • the network device determining the probability of occurrence of the fault propagation condition includes:
  • the network device determines the number of occurrences of the connection relationship between the starting point of the first fault propagation condition and the second event in the multiple update subgraphs to obtain the second number, and the event includes the second event , And the second event is an event corresponding to the first fault propagation condition;
  • the above content is to first determine the fault propagation conditions that meet the conditions according to the fault propagation time, and then determine the fault source of the current fault alarm according to the probability.
  • the network device selects the second fault propagation condition whose end point is the object of the current fault alarm and can match the updated subgraph of the communication network at the current time. According to the updated sub-graph of the communication network at the current time, a third fault propagation condition whose starting point has a fault alarm before the current time is selected from the second fault propagation conditions. A fault propagation condition with a probability greater than a probability threshold is selected from the third fault propagation conditions, and the selected fault propagation condition is taken as a fault propagation condition that satisfies the condition. When the number of fault propagation conditions that meet the conditions is 1, the network device determines the starting point of the fault propagation conditions that meet the conditions as the fault source of the current fault alarm.
  • the network device determines the propagation time of the current alarm corresponding to the fault propagation conditions that meet the conditions according to the updated sub-graph of the communication network at the current time, and the corresponding fault propagation conditions in the fault propagation conditions that meet the conditions are determined.
  • the starting point of the fault propagation condition with the smallest difference between the propagation time of the secondary alarm and the propagation time of the fault is determined as the fault source of the currently occurring fault alarm.
  • the network device may also only determine the fault propagation time corresponding to the fault propagation condition, or only determine the probability of occurrence of the fault propagation condition. In this case, the network device may only determine the fault source of the current fault alarm based on the propagation time of the fault, or only determine the fault source of the currently occurring fault alarm based on the probability.
  • the network device determines the starting point of the fault propagation conditions that meet the conditions as the fault source of the current fault alarm.
  • the network device determines the starting point of the fault propagation condition with the smallest difference between the current alarm propagation time and the fault propagation time in the fault propagation conditions that meet the conditions as the current occurrence The fault source of the fault alarm.
  • the realization process of the network device only determining the fault source of the current fault alarm based on the probability can be: from the extracted fault propagation conditions, selecting the end point as the object of the current fault alarm and being able to communicate with the updated subgraph of the network at the current time The matched second fault propagation condition.
  • a third fault propagation condition whose starting point has a fault alarm occurred before the current time is selected from the second fault propagation conditions.
  • the propagation time of this alarm corresponding to the third fault propagation condition is determined.
  • the network device predicts the object affected by the fault based on the object of the current fault alarm, the updated sub-graph of the communication network at the current time, and the fault propagation condition, and the object affected by the fault refers to the object affected by the current fault alarm. Affected and fault alarm object.
  • the network device predicts the object affected by the fault based on the object of the current fault alarm and the fault propagation condition, including:
  • the network device determines the end point of the fourth fault propagation condition as the fault-affected object.
  • the method further includes:
  • the network device predicts the time when the fault-affected object has the fault alarm based on the fault propagation time corresponding to the fourth fault propagation condition and the alarm occurrence time of the currently occurring fault alarm.
  • a device for extracting fault propagation conditions in a second aspect, is provided, and the device for extracting fault propagation conditions has the function of realizing the behavior of the method for extracting fault propagation conditions in the first aspect.
  • the device for extracting fault propagation conditions includes at least one module, and the at least one module is used to implement the method for extracting fault propagation conditions provided in the first aspect described above.
  • a network device in a fourth aspect, includes a processor and a network interface, the network interface is used to obtain data involved in implementing any of the methods described in the first aspect, and the processor is used to According to the data acquired by the network interface, the steps of the method described in the first aspect are executed.
  • a computer-readable storage medium stores instructions that, when run on a computer, cause the computer to execute the method for extracting fault propagation conditions described in the first aspect. .
  • the network device can extract the fault propagation conditions through multiple event object connection diagrams that correspond one-to-one at different times, without manually summarizing the fault propagation conditions.
  • the labor cost is reduced, and the extraction efficiency of the fault propagation conditions can be improved.
  • these multiple failures that occur in the communication network at different times can basically cover all types of failures, thereby ensuring that the extracted failure propagation conditions have a high failure coverage, and are replicable and scalable, and can be widely promoted.
  • FIG. 2 is a system architecture diagram for extracting fault transmission conditions according to an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of a method for extracting fault propagation conditions provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of a hop count between objects of 1 according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an update subgraph provided by an embodiment of the present application.
  • FIG. 9 is a flowchart of a method for predicting the propagation range of a fault according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of an apparatus for extracting fault propagation conditions provided by an embodiment of the present application.
  • the method provided in the embodiments of the present application can be applied to various communication networks, such as a data center network, a mobile communication network, and so on.
  • the devices in these communication networks can be connected to the network devices, and then the network devices can extract fault propagation conditions that can locate faults in these communication networks. That is, the network device used to extract the fault propagation condition may be a device independent of the communication network.
  • the network device used to extract the fault propagation condition can also be a device in the communication network, that is, the device in the communication network can also extract the fault propagation condition that can locate the fault that occurs in the communication network.
  • FIG. 1 is an architecture diagram of a data center network according to an embodiment of the present application.
  • the data center network includes multiple computer nodes 101, multiple tunnel endpoints 102, and multiple intermediate nodes 103.
  • a communication connection is established between a computer node 101 and a tunnel endpoint 102, and a communication connection is established between each tunnel endpoint 102 and each intermediate node 103.
  • one computer node 101 may also establish a communication connection with two or more tunnel endpoints 102.
  • the two or two The above tunnel endpoints 102 can be mutual backup nodes.
  • the multiple computer nodes 101 may be servers, firewalls, load balancers, etc.
  • the servers may be virtual machines or bare machines, that is, machines that do not include an operating system.
  • a network device 104 is also connected to the data center network.
  • the network device 104 may establish a communication connection with each computer node 101, each tunnel endpoint 102, and each intermediate node 103.
  • the network device 104 since a communication connection is established between the computer node 101, the tunnel endpoint 102, and the intermediate node 103, the network device 104 may only establish a communication connection with the intermediate node 103.
  • the establishment of a communication connection between the network device 104 and the intermediate node 103 is taken as an example.
  • the network device 104 can obtain the events that occur in the data center network and the connection relationship between the objects related to these events through the interaction with the connected devices, and then generate the event object connection diagram, thereby Extract the fault propagation conditions.
  • FIG. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • the computer device may be any device involved in the content described in the part of FIG. 1 and FIG. 2, for example, computer nodes 101, Tunnel endpoint 102, intermediate node 103, network device 104, etc.
  • the computer device includes at least one processor 301, a communication bus 302, a memory 303, and at least one communication interface 304.
  • the processor 301 may be a general-purpose central processing unit (CPU), a network processor (NP), a microprocessor, or may be one or more integrated circuits used to implement the solution of the present application, for example, a dedicated integrated circuit Circuit (application-specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • the memory 303 can be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, or it can be a random access memory (RAM) or can store information and instructions
  • Other types of dynamic storage devices can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage , CD storage (including compressed CDs, laser disks, CDs, digital universal CDs, Blu-ray CDs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures And any other media that can be accessed by the computer, but not limited to this.
  • the memory 303 may exist independently and is connected to the processor 301 through the communication bus 302.
  • the memory 303 may also be integrated with the processor 301.
  • the communication interface 304 uses any device such as a transceiver for communicating with other devices or a communication network.
  • the communication interface 304 includes a wired communication interface, and may also include a wireless communication interface.
  • the wired communication interface may be, for example, an Ethernet interface.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the wireless communication interface may be a wireless local area network (WLAN) interface, a cellular network communication interface, or a combination thereof.
  • WLAN wireless local area network
  • the processor 301 may include one or more CPUs, such as CPU0 and CPU1 as shown in FIG. 3.
  • the computer device may include multiple processors, such as the processor 301 and the processor 305 as shown in FIG. 3.
  • processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU).
  • the processor here may refer to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
  • the computer device may further include an output device 306 and an input device 307.
  • the output device 306 communicates with the processor 301 and can display information in a variety of ways.
  • the output device 306 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector, etc.
  • the input device 307 communicates with the processor 301, and can receive user input in a variety of ways.
  • the input device 307 may be a mouse, a keyboard, a touch screen device, or a sensor device.
  • FIG. 4 is a flowchart of a method for extracting fault propagation conditions shown in an embodiment of the present application. The method includes the following steps.
  • Step 401 The network device obtains multiple event object connection diagrams corresponding to the communication network at multiple different times, and multiple event object connection diagrams at different times correspond to multiple event object connection diagrams one by one, and each event object in the multiple event object connection diagrams is connected
  • the diagram is used to describe the events related to the fault that occur in the communication network, and the connection relationship between the objects related to the event.
  • the event objects of the communication network are connected
  • the graph may change with time. Therefore, multiple event object connection graphs corresponding to the communication network can be obtained at multiple different times.
  • the network device can obtain fault alarms that occur in the communication network, and logs during the operation of the communication network, extract events related to the fault from the logs, and objects related to the event, and then follow the extracted Event, and the relationship between the objects related to the event, generate the event object connection diagram.
  • the specific implementation process can refer to related technologies.
  • the aforementioned multiple different times may refer to multiple different moments or multiple different time periods.
  • the aforementioned multiple different times may also include both moments and time periods. That is, the above multiple event object connection diagrams may all be event object connection diagrams corresponding to different moments, or all of the event object connection diagrams corresponding to different time periods, or part of the event object connection diagrams corresponding to different moments. , The other part is the connection diagram of event objects corresponding to different time periods.
  • the event object connection diagram can be represented in the form of graphics or other forms.
  • it can be represented in the form of table items.
  • the embodiment of the present application represents the event object connection diagram.
  • the form is not limited.
  • Step 402 The network device determines multiple subgraphs according to the multiple event object connection graphs, and the multiple subgraphs have a one-to-one correspondence with the multiple event object connection graphs, and each subgraph in the multiple subgraphs is a subset of the corresponding event object connection graph.
  • the number of hops between the object that generates the first event and any object related to the first event is not greater than N, and the events related to the failure include the first event, and N is an integer greater than or equal to 1. .
  • the network device may obtain from the event object connection diagram the relationship between the object that generated the first event and any object related to the first event.
  • the number of hops between is less than or equal to N connections. Since the first event is any event related to the fault, when the connection relationship between the object that generates each event and any object related to each event is less than or equal to N is obtained, the The subgraph corresponding to the event object connection graph.
  • the network device may obtain any object related to the first event from the event object connection diagram.
  • the number of hops between is equal to the connection relationship of N. Since the first event is any event related to the fault, the event object can be obtained when the connection relationship between the object that generates each event and any object related to each event is equal to N is obtained.
  • the path connected by these two objects does not contain the object that has a fault alarm, that is, the two objects are connected There are no fault-related events on the path, and the two objects are connected by L3link. Therefore, the number of hops between these two objects is equal to 2.
  • Step 403 The network device updates the object in each subgraph of the multiple subgraphs to the corresponding object type according to the correspondence between the object and the object type, and obtains multiple updated subgraphs, and multiple updated subgraphs and multiple subgraphs are obtained.
  • the pictures correspond one by one.
  • the network device may obtain the object type corresponding to the object in the sub-graph from the corresponding relationship between the object and the object type, and the object in the sub-graph Replace with the corresponding object type to get an updated subgraph.
  • Table 1 the correspondence between objects and object types can be shown in Table 1 below.
  • Table 1 After updating an object in a subgraph to the corresponding object type, the update shown in Figure 7 can be obtained. Subgraph.
  • Step 404 The network device determines a fault propagation condition according to a plurality of updated subgraphs, and the fault propagation condition is used to indicate the path through which the fault is propagated in the communication network.
  • the network device may convert the multiple update subgraphs into graph embedding vectors respectively according to the graph embedding algorithm to obtain multiple graph embedding vectors corresponding to the multiple update subgraphs one-to-one.
  • a plurality of sub-graph sets are determined according to a plurality of graph embedding vectors and a clustering algorithm, and each sub-graph set in the multiple sub-graph sets includes at least one updated sub-graph of the plurality of updated sub-graphs.
  • the frequent subgraph mining algorithm the fault propagation conditions are extracted from the updated subgraph included in each subgraph set in the multiple subgraph sets.
  • the implementation process for the network device to determine the multiple sub-graph sets according to the multiple graph embedding vectors and the clustering algorithm may be: determining the similarity between every two graph embedding vectors in the multiple graph embedding vectors.
  • the multiple updated sub-graphs are clustered according to the determined similarity and the clustering algorithm to obtain multiple sets of sub-graphs.
  • fault propagation conditions can be expressed in the form of text or graphics.
  • this fault propagation condition is used to indicate that the neighbor protocol status failure in the OSPF network segment (OsNetwork) causes the BGP Loopback port IP to be unreachable (L3link), which ultimately leads to BGP neighbors Broken link (BGP Peer).
  • the probability of occurrence of the extracted fault propagation condition and/or the fault propagation time can also be determined. That is, the network equipment can determine the probability of the extracted fault propagation conditions, and can also determine the fault propagation time corresponding to the extracted fault propagation conditions, and can also determine the probability of the extracted fault propagation conditions and the corresponding fault propagation time.
  • the implementation process for the network device to determine the fault propagation time corresponding to the extracted fault propagation condition may be: the network device determines the alarm occurrence time of the start point of the first fault propagation condition and the alarm occurrence time of the end point, the first fault
  • the propagation condition is the fault propagation condition extracted from the first sub-graph set, and the multiple sub-graph sets include the first sub-graph set.
  • the network device determines the difference between the alarm occurrence time at the starting point of the first fault propagation condition and the alarm occurrence time at the end point as the fault propagation time corresponding to the first fault propagation condition.
  • the event object connection diagram includes events related to the fault, and the event related to the fault will generate a fault alarm, and the fault alarm is usually accompanied by the alarm occurrence time.
  • the event related to the fault can carry the alarm occurrence time, and then the update subgraph can carry the alarm occurrence time.
  • the realization process of the network device determining the alarm occurrence time of the starting point of the first fault propagation condition and the alarm occurrence time of the end point may be: determining the updated subgraph of the first fault propagation condition from the first subgraph set, and determining from the determined
  • the update subgraph obtains the alarm occurrence time carried by the event connected to the start point of the first fault propagation condition, and the alarm occurrence time carried by the event connected to the end point.
  • the average value of the alarm occurrence time carried by the events connected to these starting points is determined as the alarm occurrence time of the starting point of the first fault propagation condition, and the average value of the alarm occurrence time carried by the events connected to these end points is determined as the first fault propagation The alarm occurrence time at the end of the condition.
  • the network device may also determine the update subgraph in which the first fault propagation condition occurs from the first subgraph set, and obtain the alarm occurrence time carried by the event connected to the starting point of the first fault propagation condition from the determined update subgraph, And the alarm occurrence time carried by the event connected to the end point, determine the difference between the alarm occurrence time carried by the event connected to the starting point of the acquired first fault propagation condition and the alarm occurrence time carried by the event connected to the end point, it will be determined The average of these differences is determined as the fault propagation time corresponding to the first fault propagation condition.
  • each subgraph set can be determined according to the above method The fault propagation time corresponding to each fault propagation condition extracted in.
  • the process for the network device to determine the probability of occurrence of the fault propagation condition may be: the network device determines the number of updated subgraphs in the first subgraph set where the first fault propagation condition appears, and the first fault propagation condition is For the fault propagation conditions extracted from the first sub-graph set, the multiple sub-graph sets include the first sub-graph set. The network device determines the probability of the occurrence of the first fault propagation condition according to the ratio between the determined number and the total number of updated sub-graphs in the first sub-graph set.
  • the first fault propagation condition is a fault propagation condition extracted from the first sub-graph set. Therefore, according to the above method, it can be determined in each sub-graph set The probability of each fault propagation condition extracted.
  • the network device may directly determine the ratio between the determined number and the total number of updated subgraphs in the first subgraph set as the probability of the occurrence of the first fault propagation condition.
  • the network device extracts fault propagation condition 1 from the first subgraph set, and the number of update subgraphs with fault propagation condition 1 in the first subgraph set is 20, and the update subgraphs in the first subgraph set The total number of graphs is 30, so the probability of occurrence of fault propagation condition 1 can be 67%.
  • the network device may determine the number of times the first fault propagation condition appears in the multiple update subgraphs to obtain the first number, and the extracted fault propagation condition includes the first fault propagation condition. Determine the number of times the connection relationship between the starting point of the first fault propagation condition and the second event appears in the multiple update subgraphs to obtain the second number.
  • the fault-related events include the second event, and the second event is the first The event corresponding to the fault propagation condition. According to the ratio between the first frequency and the second frequency, the probability of the occurrence of the first fault propagation condition is determined.
  • the network device may directly determine the ratio between the first number of times and the second number of times as the probability of the occurrence of the first fault propagation condition.
  • the network device can extract the fault propagation conditions through multiple event object connection graphs that correspond to each other at different times, without manually summarizing the fault propagation conditions, which can reduce labor costs and improve the extraction of fault propagation conditions. effectiveness.
  • these multiple failures that occur in the communication network at different times can basically cover all types of failures, thereby ensuring that the extracted failure propagation conditions have a high failure coverage, and are replicable and scalable, and can be widely promoted.
  • the method for determining the source of the fault provided by the embodiment of the present application can be implemented on the basis of the embodiment shown in FIG. 4, that is, the network device extracts the fault propagation condition according to the embodiment shown in FIG. 4, and determines that the fault propagation condition occurs After the probability of, and the fault propagation time corresponding to the fault propagation condition, the fault source can be determined according to the following steps 801-803.
  • Step 801 The network device selects the fault propagation conditions that meet the conditions from the extracted fault propagation conditions according to the object of the current fault alarm, the updated sub-graph of the communication network at the current time, and the fault propagation time corresponding to the fault propagation condition.
  • the network device may determine the fault propagation conditions that meet the conditions according to the following steps (1)-(4).
  • the network device may select the fault propagation condition whose destination is the object of the current fault alarm from the extracted fault propagation conditions. From the selected fault propagation conditions, filter out the fault propagation conditions in the indicated path existing in the update sub-graph of the communication network at the current time, and use the filtered fault propagation conditions as the end point for the current fault alarm object and be able to compare with the current time. The second fault propagation condition matched by the updated subgraph of the communication network.
  • the communication network may correspond to different event object connection diagrams at different times. Therefore, the network device can determine the updated subgraph of the current time communication network according to the current time communication network event object connection diagram. That is, when the network device determines the source of the fault, it can determine the event object connection diagram of the current time communication network, determine the current time subgraph according to the current time communication network event object connection diagram, and according to the correspondence between objects and object types Relationship, update the object in the current time subgraph to the corresponding object type, and obtain the updated subgraph of the current time communication network.
  • a third fault propagation condition whose starting point has a fault alarm before the current time is selected from the second fault propagation conditions.
  • This alarm propagation time refers to the alarm occurrence time of the starting point of the third fault propagation condition and the current fault
  • the difference between the alarm occurrence time of the alarm and the alarm occurrence time of the starting point of the third fault propagation condition is determined from the update submap of the communication network at the current time.
  • the selected fault propagation condition can be regarded as the fault propagation condition that satisfies the condition.
  • time threshold can be set according to usage requirements, for example, 2 seconds, which is not limited in the embodiment of the present application.
  • Step 802 When the number of fault propagation conditions that meet the conditions is 1, the network device determines the starting point of the fault propagation conditions that meet the conditions as the fault source of the current fault alarm.
  • Step 803 When the number of the fault propagation conditions that meet the condition is greater than 1, the network device determines the starting point of the fault propagation condition with the highest probability among the fault propagation conditions that meet the condition as the fault source of the current fault alarm.
  • each fault propagation condition corresponds to a probability, and there is usually only one fault source, when the number of fault propagation conditions that meet the condition is greater than 1, the fault propagation condition with the highest probability can be selected from the fault propagation conditions that meet the conditions, and the The starting point of the most probable fault propagation condition is determined as the fault source of the current fault alarm.
  • a fault propagation condition with a probability greater than a probability threshold is selected from the third fault propagation conditions, and the selected fault propagation condition is taken as a fault propagation condition that satisfies the condition.
  • the network device determines the starting point of the fault propagation conditions that meet the conditions as the fault source of the current fault alarm.
  • the network device determines the propagation time of the current alarm corresponding to the fault propagation conditions that meet the conditions according to the updated sub-graph of the communication network at the current time, and the corresponding fault propagation conditions in the fault propagation conditions that meet the conditions are determined.
  • the starting point of the fault propagation condition with the smallest difference between the propagation time of the secondary alarm and the propagation time of the fault is determined as the fault source of the currently occurring fault alarm.
  • the network device may also only determine the fault propagation time corresponding to the fault propagation condition, or only determine the probability of occurrence of the fault propagation condition. In this case, the network device may only determine the fault source of the current fault alarm based on the propagation time of the fault, or only determine the fault source of the currently occurring fault alarm based on the probability.
  • the realization process of the network equipment only to determine the fault source of the current fault alarm based on the fault propagation time can be: from the extracted fault propagation conditions, select the end point as the object of the current fault alarm and be able to communicate the update of the network with the current time.
  • the second fault propagation condition matched by the subgraph.
  • a third fault propagation condition whose starting point has a fault alarm occurred before the current time is selected from the second fault propagation conditions.
  • the propagation time of this alarm corresponding to the third fault propagation condition is determined.
  • the realization process of the network device only determining the fault source of the current fault alarm based on the probability can be: from the extracted fault propagation conditions, selecting the end point as the object of the current fault alarm and being able to communicate with the updated subgraph of the network at the current time The matched second fault propagation condition.
  • a third fault propagation condition whose starting point has a fault alarm occurred before the current time is selected from the second fault propagation conditions.
  • the propagation time of this alarm corresponding to the third fault propagation condition is determined.
  • a fault propagation condition with a probability greater than a probability threshold is selected from the third fault propagation conditions, and the selected fault propagation condition is taken as a fault propagation condition that satisfies the condition.
  • the network device determines the starting point of the fault propagation conditions that meet the conditions as the fault source of the current fault alarm.
  • the network device determines the starting point of the fault propagation condition with the highest probability among the fault propagation conditions that meet the condition as the fault source of the current fault alarm.
  • the fault propagation conditions are extracted according to multiple event object connection graphs that correspond to each other at different times, the accuracy of the extracted fault propagation conditions can be guaranteed, and the extracted fault propagation conditions can be ensured according to the extracted The accuracy of the fault source determined by the fault propagation conditions. Moreover, since the fault coverage rate of the extracted fault propagation conditions is relatively high, the probability that the source of the fault can be determined according to the extracted fault propagation conditions is also relatively high.
  • Figure 9 is a flowchart of a method for predicting the propagation range of a fault according to an embodiment of the present application. The method is based on the object of the current fault alarm, the updated sub-graph of the communication network at the current time, and the extracted fault propagation. Conditions, predict the objects affected by the failure. Among them, the fault-affected object refers to the object that has a fault alarm that is affected by the current fault alarm. The method includes the following steps.
  • Step 901 From the extracted fault propagation conditions, the network device selects the fourth fault propagation condition whose starting point is the object of the current fault alarm and can match the updated subgraph of the communication network at the current time.
  • the communication network may correspond to different event object connection diagrams at different times. Therefore, the network device can determine the updated subgraph of the current time communication network according to the current time communication network event object connection diagram. That is, when the network device determines the source of the fault, it can determine the event object connection diagram of the current time communication network, determine the current time subgraph according to the current time communication network event object connection diagram, and according to the correspondence between objects and object types Relationship, update the object in the current time subgraph to the corresponding object type, and obtain the updated subgraph of the current time communication network.
  • Step 902 The network device determines the end point of the fourth fault propagation condition as the fault-affected object.
  • Step 903 The network device predicts the time when the fault-affected object has the fault alarm based on the fault propagation time corresponding to the fourth fault propagation condition and the alarm occurrence time of the currently occurring fault alarm.
  • the fault propagation time refers to the difference between the alarm occurrence time at the starting point of the fault propagation condition and the alarm occurrence time at the end point, when the affected objects of the fault are predicted, the fault propagation time corresponding to the fourth fault propagation condition can be used And the alarm occurrence time of the current fault alarm, predict the time when the fault-affected object has the fault alarm.
  • the sum of the alarm occurrence time of the currently occurring fault alarm and the fault propagation time corresponding to the fourth fault propagation condition may be determined as the time when the fault-affected object has the fault alarm.
  • FIG. 10 is a schematic structural diagram of a device for extracting fault propagation conditions shown in an embodiment of the present application.
  • the device can be implemented as part or all of a network device by software, hardware, or a combination of the two. It can be the network device described in part of Figure 1.
  • the device includes: an acquisition module 1001, a first determination module 1002, an update module 1003, and a second determination module 1004.
  • the obtaining module 1001 is configured to perform the operation of step 401 in the embodiment shown in FIG. 4;
  • the first determining module 1002 is configured to perform the operation of step 402 in the embodiment shown in FIG. 4;
  • the update module 1003 is configured to perform the operation of step 403 in the embodiment shown in FIG. 4;
  • the second determining module 1004 includes:
  • the first determining sub-module is configured to determine multiple sub-graph sets according to multiple graph embedding vectors and clustering algorithms, and each sub-graph set in the multiple sub-graph sets includes at least one updated sub-graph of the multiple updated sub-graphs;
  • the first determining submodule is used for:
  • the multiple updated sub-graphs are clustered to obtain multiple sub-graph sets.
  • the second determining module 1004 includes:
  • the second extraction sub-module is used to extract fault propagation conditions from multiple updated sub-graphs according to the frequent sub-graph mining algorithm.
  • the device further includes:
  • the third determining module is used to determine the fault propagation time corresponding to the fault propagation condition
  • the screening module is used to filter the fault propagation conditions that meet the conditions from the fault propagation conditions according to the current fault alarm object, the current time communication network update sub-graph, and the fault propagation time corresponding to the fault propagation condition;
  • the fourth determination module is used to determine the starting point of the fault propagation condition that meets the condition as the fault source of the current fault alarm when the number of the fault propagation conditions that meet the condition is one.
  • the third determining module includes:
  • the second determining sub-module is used to determine the alarm occurrence time of the starting point and the end of the first fault propagation condition, where the first fault propagation condition is the fault propagation condition extracted from the first sub-graph set, and multiple sub-graph sets Including the first set of sub-pictures;
  • the third determining sub-module is used to determine the difference between the alarm occurrence time at the starting point of the first fault propagation condition and the alarm occurrence time at the end point as the fault propagation time corresponding to the first fault propagation condition.
  • the screening module includes:
  • the first selection sub-module is used to select, from the fault propagation conditions, the second fault propagation condition whose destination is the object of the current fault alarm and can match the update subgraph of the communication network at the current time;
  • the fourth determining sub-module is used to determine the propagation time of the current alarm corresponding to the third fault propagation condition according to the updated sub-map of the communication network at the current time.
  • This alarm propagation time refers to the alarm occurrence time at the starting point of the third fault propagation condition
  • the difference between the alarm occurrence time of the current fault alarm and the alarm occurrence time of the starting point of the third fault propagation condition is determined from the update submap of the communication network at the current time;
  • the third selection sub-module is used to select from the third fault propagation conditions the corresponding fault propagation condition whose difference between the propagation time of this alarm and the fault propagation time is less than the time threshold, and use the selected fault propagation condition as the satisfying condition The propagation conditions of the fault.
  • the device further includes:
  • the fifth determining module is used to determine the probability of occurrence of the fault propagation condition
  • the sixth determining module is used to determine the starting point of the fault propagation condition with the highest probability of the fault propagation conditions satisfying the condition as the fault source of the currently occurring fault alarm when the number of the fault propagation conditions satisfying the condition is greater than one.
  • the fault propagation condition is extracted from the updated subgraph included in each subgraph set in the multiple subgraph sets according to a frequent subgraph mining algorithm
  • the fifth determining module includes:
  • the fifth determining submodule is used to determine the number of updated subgraphs in the first subgraph set where the first fault propagation condition appears, the first fault propagation condition is the fault propagation condition extracted from the first subgraph set, and the multiple subgraphs
  • the graph set includes the first sub graph set;
  • the sixth determining sub-module is used to determine the probability of the occurrence of the first fault propagation condition according to the ratio between the number and the total number of updated sub-graphs in the first sub-graph set.
  • the fifth determining module includes:
  • the seventh determining sub-module is used to determine the number of times the first fault propagation condition appears in the multiple update subgraphs to obtain the first number, and the extracted fault propagation condition includes the first fault propagation condition;
  • the eighth determining sub-module is used to determine the number of occurrences of the connection relationship between the starting point of the first fault propagation condition and the second event in the multiple update subgraphs to obtain the second number, and the event related to the fault includes the second event , And the second event is the event corresponding to the first fault propagation condition;
  • the ninth determining sub-module is configured to determine the probability of the occurrence of the first fault propagation condition according to the ratio between the first frequency and the second frequency.
  • the device further includes:
  • the first prediction module is used to predict the objects affected by the fault based on the object of the current fault alarm, the updated sub-graph of the communication network at the current time, and the fault propagation conditions.
  • the affected object of the fault refers to the occurrence of the current fault alarm.
  • the object of the fault alarm is used to predict the objects affected by the fault based on the object of the current fault alarm, the updated sub-graph of the communication network at the current time, and the fault propagation conditions.
  • the first prediction module includes:
  • the fourth selection sub-module is used for the network device to select the fourth fault propagation condition whose starting point is the object of the current fault alarm and can match the updated subgraph of the communication network at the current time from the fault propagation conditions;
  • the seventh determining sub-module is used to determine the end point of the fourth fault propagation condition as the fault-affected object.
  • the device further includes:
  • the seventh determining module is used to determine the fault propagation time corresponding to the fault propagation condition
  • the second prediction module is used to predict the time when the fault-affected object has a fault alarm based on the fault propagation time corresponding to the fourth fault propagation condition and the alarm occurrence time of the currently occurring fault alarm.
  • the fault propagation conditions can be extracted through multiple event object connection graphs corresponding to one-to-one at different times, without manually summarizing the fault propagation conditions, which can reduce labor costs and improve the extraction efficiency of the fault propagation conditions.
  • these multiple failures that occur in the communication network at different times can basically cover all types of failures, thereby ensuring that the extracted failure propagation conditions have a high failure coverage, and are replicable and scalable, and can be widely promoted.
  • the device for extracting fault propagation conditions only uses the division of the above-mentioned functional modules for example when extracting the fault propagation conditions.
  • the above-mentioned function can be allocated to different functions according to needs.
  • the function module is completed, that is, the internal structure of the device is divided into different function modules to complete all or part of the functions described above.
  • the device for extracting a fault propagation condition provided by the foregoing embodiment belongs to the same concept as the embodiment of the method for extracting a fault propagation condition. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.
  • the computer may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it can be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example: floppy disk, hard disk, tape), optical medium (for example: digital versatile disc (DVD)) or semiconductor medium (for example: solid state disk (SSD)) Wait.
  • the computer-readable storage medium mentioned in this application may be a non-volatile storage medium, in other words, it may be a non-transitory storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请公开了一种提取故障传播条件的方法、装置及存储介质,属于通信技术领域,进一步涉及AI在通信技术领域中的应用。该方法包括:网络设备在多个不同时间获取通信网络对应的多个事件对象连接图,根据多个事件对象连接图确定多个子图,根据对象与对象类型之间的对应关系,将多个子图中的每个子图中的对象更新为对应的对象类型,得到多个更新子图,根据多个更新子图确定故障传播条件,故障传播条件用于指示故障在所述通信网络中被传播的路径。本申请无需人工总结故障传播条件,可以减少人工成本,进而可以提高故障传播条件的提取效率。而且,提取出的故障传播条件的故障覆盖率较高,具备可复制性和可扩展性,可以大量推广。

Description

提取故障传播条件的方法、装置及存储介质
本申请要求于2019年09月17日提交国家知识产权局、申请号为201910877916.8、发明名称为“提取故障传播条件的方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,进一步涉及人工智能(artificial intelligence,AI)在通信技术领域中的应用,特别涉及一种提取故障传播条件的方法、装置及存储介质。
背景技术
随着通信网络系统复杂性的增加,通信网络的故障定位带来的运维成本不断增加。比如,在数据中心网络中,设备重启、路由器的身份标识(identity,ID)冲突等故障出现的原因非常复杂,这些故障的定位带来的运维在不断的增加。为了降低运维成本,通常需要提取故障传播条件,通过故障传播条件来定位故障。
相关技术中,往往是人工总结故障传播条件,也可以称为故障判断规则。进而可以根据人工总结的故障传播条件进行故障定位。但是,实际实施时,人工往往只能针对某类故障总结故障传播条件,故障覆盖率差,且耗时耗力,不具备可复制性和可扩展性,无法大量推广。
发明内容
本申请提供了一种提取故障传播条件的方法、装置及存储介质,可以解决相关技术的故障覆盖率差,耗时耗力,不具备可复制性和可扩展性,无法大量推广的问题。所述技术方案如下:
第一方面,提供了一种提取故障传播条件的方法,所述方法包括:
网络设备在多个不同时间获取通信网络对应的多个事件对象连接图,所述多个不同时间与所述多个事件对象连接图一一对应,所述多个事件对象连接图中的每个事件对象连接图用于描述所述通信网络中发生的、与故障相关的事件,以及与所述事件相关的对象之间的连接关系;
所述网络设备根据所述多个事件对象连接图确定多个子图,所述多个子图与所述多个事件对象连接图一一对应,所述多个子图中的每个子图是对应的事件对象连接图的子集,所述多个子图中的每个子图中产生第一事件的对象与所述第一事件相关的任意对象之间的跳数不大于N,所述事件包括所述第一事件,所述N为大于或等于1的整数;
所述网络设备根据对象与对象类型之间的对应关系,将所述多个子图中的每个子图中的对象更新为对应的对象类型,得到多个更新子图,所述多个更新子图与所述多个子图一一对应;
所述网络设备根据所述多个更新子图确定故障传播条件,所述故障传播条件用于指示故障在所述通信网络中被传播的路径。
需要说明的是,上述多个不同时间可以是指多个不同的时刻,也可以是指多个不同的时间段,当然,上述多个不同时间也可以既包括时刻,又包括时间段。也即是,上述多个事件对象连接图可以全部为不同时刻对应的事件对象连接图,也可以全部为不同时间段对应的事件对象连接图,还可以是一部分为不同时刻对应的事件对象连接图,另一部分为不同时间段对应的事件对象连接图。
另外,在本申请实施例中,事件对象连接图可以以图形的形式表示,也可以是用其他 形式表示,比如,可以以表项的形式来表示,本申请实施例对事件对象连接图的表示形式不作限定。
值得注意的是,网络设备根据频繁子图挖掘算法,从一个子图集合包括的更新子图中提取出的故障传播条件的数量可以为0,也可以为1,当然,也可以大于1。而且,有的更新子图中可能提取不出故障传播条件,有的更新子图中可以提取出数量大于或等于1的故障传播条件,且两个或者两个以上的更新子图中也可能会提取出相同的故障传播条件。
可选地,所述网络设备根据所述多个更新子图确定故障传播条件,包括:
所述网络设备根据图嵌入算法,将所述多个更新子图分别转换为图嵌入向量,得到与所述多个更新子图一一对应的多个图嵌入向量;
所述网络设备根据所述多个图嵌入向量和聚类算法确定多个子图集合,所述多个子图集合中的每个子图集合包括所述多个更新子图中的至少一个更新子图;
所述网络设备根据频繁子图挖掘算法,从所述多个子图集合中的每个子图集合包括的更新子图中提取所述故障传播条件。
可选地,所述网络设备根据所述多个图嵌入向量和聚类算法确定多个子图集合,包括:
所述网络设备确定所述多个图嵌入向量中每两个图嵌入向量之间的相似度;
所述网络设备根据所述相似度和所述聚类算法对所述多个更新子图进行聚类,得到所述多个子图集合。
由于图嵌入向量可以代表更新子图,因此,网络设备根据多个图嵌入向量中每两个图嵌入向量之间的相似度,按照聚类算法可以对多个更新子图进行聚类,得到多个子图集合。
可选地,所述网络设备根据所述多个更新子图确定故障传播条件,包括:
所述网络设备根据频繁子图挖掘算法,从所述多个更新子图中提取所述故障传播条件。
可选地,在所述网络根据所述多个更新子图确定故障传播条件之后,所述方法还包括:
所述网络设备确定所述故障传播条件对应的故障传播时间;
所述方法还包括:
所述网络设备根据当前发生故障告警的对象、当前时间所述通信网络的更新子图和所述故障传播条件对应的故障传播时间,从所述故障传播条件中筛选满足条件的故障传播条件;
当所述满足条件的故障传播条件的数量为1,所述网络设备将所述满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。
可选地,所述网络设备确定所述故障传播条件对应的故障传播时间,包括:
所述网络设备确定第一故障传播条件的起点的告警发生时间和终点的告警发生时间,所述第一故障传播条件为第一子图集合中提取出的故障传播条件,所述多个子图集合包括所述第一子图集合;
所述网络设备将所述第一故障传播条件的起点的告警发生时间与所述终点的告警发生时间之间的差值,确定为所述第一故障传播条件对应的故障传播时间。
可选地,所述网络设备根据当前发生故障告警的对象、当前时间所述通信网络的更新子图和所述故障传播条件对应的故障传播时间,从所述故障传播条件中筛选满足条件的故障传播条件,包括:
所述网络设备从所述故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间所述通信网络的更新子图匹配的第二故障传播条件;
所述网络设备根据当前时间所述通信网络的更新子图,从所述第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件;
所述网络设备根据当前时间所述通信网络的更新子图,确定所述第三故障传播条件对应的本次告警传播时间,所述本次告警传播时间是指所述第三故障传播条件的起点的告警发生时间与当前发生的故障告警的告警发生时间之间的差值,所述第三故障传播条件的起点的告警发生时间是从当前时间所述通信网络的更新子图中确定的;
所述网络设备从所述第三故障传播条件中,选择对应的本次告警传播时间与故障传播时间之间的差值小于时间阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。
当第三故障传播条件对应的本次告警传播时间与故障传播时间之间的差值小于时间阈值时,可以说明当前发生的故障告警与第三故障传播条件对应的故障告警相同的概率的比较大,因此,可以将选择的故障传播条件作为满足条件的故障传播条件。
可选地,所述方法还包括:
所述网络设备确定所述故障传播条件出现的概率;
当所述满足条件的故障传播条件的数量大于1,所述网络设备将所述满足条件的故障传播条件中出现的概率最大的故障传播条件的起点确定为当前发生的故障告警的故障源。
由于每个故障传播条件对应一个概率,且故障源通常只有一个,因此,当满足条件的故障传播条件的数量大于1时,可以从满足条件的故障传播条件中选择概率最大的故障传播条件,将概率最大的故障传播条件的起点确定为当前发生的故障告警的故障源。
可选地,所述故障传播条件是所述网络设备根据频繁子图挖掘算法,从多个子图集合中的每个子图集合包括的更新子图中提取的;
所述网络设备确定所述故障传播条件出现的概率,包括:
所述网络设备确定第一子图集合中出现第一故障传播条件的更新子图的个数,所述第一故障传播条件为所述第一子图集合中提取出的故障传播条件,所述多个子图集合包括所述第一子图集合;
所述网络设备根据所述个数与所述第一子图集合中的更新子图的总数之间的比值,确定所述第一故障传播条件出现的概率。
可选地,所述网络设备确定所述故障传播条件出现的概率,包括:
所述网络设备确定第一故障传播条件在所述多个更新子图中出现的次数,得到第一次数,所述故障传播条件包括所述第一故障传播条件;
所述网络设备确定所述第一故障传播条件的起点与第二事件之间的连接关系在所述多个更新子图中出现的次数,得到第二次数,所述事件包括所述第二事件,且所述第二事件为所述第一故障传播条件对应的事件;
所述网络设备根据所述第一次数与所述第二次数之间的比值,确定所述第一故障传播条件出现的概率。
上述内容是先根据故障传播时间确定满足条件的故障传播条件之后,再根据概率确定当前发生的故障告警的故障源。当然,也可以先根据概率确定满足条件的故障传播条件,再根据故障传播时间确定当前发生的故障告警的故障源。
也即是,网络设备从提取的故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第二故障传播条件。根据当前时间通信网络的 更新子图,从第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件。从第三故障传播条件中选择概率大于概率阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。当满足条件的故障传播条件的数量为1,网络设备将满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。当满足条件的故障传播条件的数量大于1,网络设备根据当前时间通信网络的更新子图,确定满足条件的故障传播条件对应的本次告警传播时间,将满足条件的故障传播条件中对应的本次告警传播时间与故障传播时间之间的差值最小的故障传播条件的起点确定为当前发生的故障告警的故障源。
不管是先按照故障传播时间再按照概率确定当前发生的故障告警的故障源,还是先按照概率再按照故障传播时间确定当前发生的故障告警的故障源,网络设备在提取出故障传播条件之后,都需要确定故障传播条件出现的概率以及对应的故障传播时间。但是,网络设备也可以只确定故障传播条件对应的故障传播时间,或者只确定故障传播条件出现的概率。在这种情况下,网络设备可以只根据故障传播时间确定当前发生的故障告警的故障源,或者只根据概率确定当前发生的故障告警的故障源。
其中,网络设备只根据故障传播时间确定当前发生的故障告警的故障源的实现过程可以为:从提取的故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第二故障传播条件。根据当前时间通信网络的更新子图,从第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件。根据当前时间通信网络的更新子图,确定第三故障传播条件对应的本次告警传播时间。从第三故障传播条件中,选择对应的本次告警传播时间与故障传播时间之间的差值小于时间阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。当满足条件的故障传播条件的数量为1,网络设备将满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。当满足条件的故障传播条件的数量大于1,网络设备将满足条件的故障传播条件中对应的本次告警传播时间与故障传播时间之间的差值最小的故障传播条件的起点确定为当前发生的故障告警的故障源。
其中,网络设备只根据概率确定当前发生的故障告警的故障源的实现过程可以为:从提取的故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第二故障传播条件。根据当前时间通信网络的更新子图,从第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件。根据当前时间通信网络的更新子图,确定第三故障传播条件对应的本次告警传播时间。从第三故障传播条件中选择概率大于概率阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。当满足条件的故障传播条件的数量为1,网络设备将满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。当满足条件的故障传播条件的数量大于1,网络设备将满足条件的故障传播条件中概率最大的故障传播条件的起点确定为当前发生的故障告警的故障源。
可选地,在所述网络设备根据所述多个更新子图确定故障传播条件之后,所述方法还包括:
所述网络设备根据当前发生故障告警的对象、当前时间所述通信网络的更新子图和所述故障传播条件,预测故障受影响对象,所述故障受影响对象是指受当前发生的故障告警的影响而发生故障告警的对象。
可选地,所述网络设备根据当前发生故障告警的对象和所述故障传播条件,预测故障受影响对象,包括:
所述网络设备从所述故障传播条件中,选择起点为当前发生故障告警的对象且能够与当前时间所述通信网络的更新子图匹配的第四故障传播条件;
所述网络设备将所述第四故障传播条件的终点确定为所述故障受影响对象。
可选地,所述方法还包括:
所述网络设备确定所述故障传播条件对应的故障传播时间;
所述网络设备根据所述第四故障传播条件对应的故障传播时间和当前发生的故障告警的告警发生时间,预测所述故障受影响对象发生故障告警的时间。
第二方面,提供了一种提取故障传播条件的装置,所述提取故障传播条件的装置具有实现上述第一方面中提取故障传播条件的方法行为的功能。所述提取故障传播条件的装置包括至少一个模块,该至少一个模块用于实现上述第一方面所提供的提取故障传播条件的方法。
第三方面,提供了一种网络设备,所述网络设备包括处理器和存储器,所述存储器用于存储执行上述第一方面所提供的提取故障传播条件的方法的程序,以及存储用于实现上述第一方面所提供的提取故障传播条件的方法所涉及的数据。所述处理器被配置为用于执行所述存储器中存储的程序。所述存储设备的操作装置还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。
第四方面,提供了一种网络设备,所述网络设备包括处理器和网络接口,所述网络接口用于获取实现上述第一方面任一所述的方法所涉及的数据,所述处理器用于根据所述网络接口获取的数据,执行上述第一方面所述的方法的步骤。
第五方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的提取故障传播条件的方法。
第六方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的提取故障传播条件的方法。
上述第二方面、第三方面、第四方面、第五方面和第六方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
本申请提供的技术方案至少可以带来以下有益效果:在本申请中,网络设备可以通过多个不同时间一一对应的多个事件对象连接图提取故障传播条件,无需人工总结故障传播条件,可以减少人工成本,进而可以提高故障传播条件的提取效率。而且,这多个不同时间内通信网络发生的故障基本可以覆盖所有的故障类型,从而保证提取出的故障传播条件的故障覆盖率较高,且具备可复制性和可扩展性,可以大量推广。
附图说明
图1是本申请实施例提供的一种数据中心网络的架构图;
图2是本申请实施例提供的一种提取故障传输条件的系统架构图;
图3是本申请实施例提供的一种计算机设备的结构示意图;
图4是本申请实施例提供的一种提取故障传播条件的方法的流程图;
图5是本申请实施例提供的一种对象之间的跳数为1的示意图;
图6是本申请实施例提供的一种对象之间的跳数为2的示意图;
图7是本申请实施例提供的一种更新子图的示意图;
图8是本申请实施例提供的一种确定故障源的方法的流程图;
图9是本申请实施例提供的一种预测故障传播范围的方法的流程图;
图10是本申请实施例提供的一种提取故障传播条件的装置的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请实施例提供的方法可以应用于各种通信网络中,比如,数据中心网络、移动通讯网络等。这些通信网络中的设备可以与网络设备连接,进而通过网络设备提取能够对这些通信网络中发生的故障进行定位的故障传播条件。也即是,用于提取故障传播条件的网络设备可以是独立于通信网络之外的设备。当然,用于提取故障传播条件的网络设备也可以为通信网络中的设备,也即是,通过通信网络中的设备也可以提取能够对通信网络中发生的故障进行定位的故障传播条件。
请参考图1,图1是根据本申请实施例示出的一种数据中心网络的架构图。该数据中心网络包括多个计算机节点101、多个隧道端点102和多个中间节点103。一个计算机节点101与一个隧道端点102之间建立有通信连接,每个隧道端点102与每个中间节点103之间建立有通信连接。可选地,为了提高计算机节点101与隧道端点102之间的通信可靠性,一个计算机节点101也可以与两个或者两个以上的隧道端点102建立通信连接,此时,这两个或者两个以上的隧道端点102可以互为备份节点。该多个计算机节点101可以为服务器、防火墙、负载均衡器等等,服务器可以为虚拟机,也可以为裸机,即不包括操作系统的机器。
对于图1所示的数据中心网络,隧道端点102或者中间节点103可以作为提取故障传播条件的网络设备,也即是,隧道端点102或中间节点103可以获取数据中心网络中发生的事件、以及与这些事件相关的对象之间的连接关系,进而生成事件对象连接图,从而提取故障传播条件。示例性地,当该数据中心网络的结构为脊叶(Spine-Leaf)结构时,隧道端点102可以为Leaf节点,中间节点103可以为Spine节点。也即是,Leaf节点和Spine节点均可以作为提取故障传播条件的网络设备。
可选地,请参考图2,该数据中心网络还连接有网络设备104。在一些实施例中,网络设备104可以与每个计算机节点101、每个隧道端点102和每个中间节点103之间建立有通信连接。在另一些实施例中,由于计算机节点101、隧道端点102和中间节点103之间建立有通信连接,因此,网络设备104可以仅与中间节点103建立通信连接。其中,图2中以网络设备104与中间节点103建立通信连接为例。在这种情况下,网络设备104可以通过与连接的设备之间的交互,获取数据中心网络中发生的事件、以及与这些事件相关的对象之间的连接关系,进而生成事件对象连接图,从而提取故障传播条件。
需要说明的是,由于数据中心网络中的数据传输是通过隧道实现的,因此,上述隧道 端点102可以为隧道的入口端点,也可以为隧道的出口端点,上述中间节点103可以为隧道所经过的网络节点。
请参考图3,图3是根据本申请实施例示出的一种计算机设备的结构示意图,该计算机设备可以是图1和图2部分描述的内容中涉及的任一设备,比如,计算机节点101、隧道端点102、中间节点103、网络设备104等。该计算机设备包括至少一个处理器301、通信总线302、存储器303以及至少一个通信接口304。
处理器301可以是一个通用中央处理器(central processing unit,CPU)、网络处理器(NP)、微处理器、或者可以是一个或多个用于实现本申请方案的集成电路,例如,专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信总线302用于在上述组件之间传送信息。通信总线302可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器303可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,也可以是随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only Memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器303可以是独立存在,并通过通信总线302与处理器301相连接。存储器303也可以和处理器301集成在一起。
通信接口304使用任何收发器一类的装置,用于与其它设备或通信网络通信。通信接口304包括有线通信接口,还可以包括无线通信接口。其中,有线通信接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线通信接口可以为无线局域网(wireless local area networks,WLAN)接口,蜂窝网络通信接口或其组合等。
在具体实现中,作为一种实施例,处理器301可以包括一个或多个CPU,如图3中所示的CPU0和CPU1。
在具体实现中,作为一种实施例,计算机设备可以包括多个处理器,如图3中所示的处理器301和处理器305。这些处理器中的每一个可以是一个单核处理器(single-CPU),也可以是一个多核处理器(multi-CPU)。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,计算机设备还可以包括输出设备306和输入设备307。输出设备306和处理器301通信,可以以多种方式来显示信息。例如,输出设备306可以是液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备307和处理器301通信,可以以多种方式接收用户的输入。例如,输入设备307可以是鼠标、 键盘、触摸屏设备或传感设备等。
在一些实施例中,存储器303用于存储执行本申请方案的程序代码310,处理器301可以执行存储器303中存储的程序代码310。也即是,该计算机设备可以通过处理器301以及存储器303中的程序代码310,来实现下文图4、图5、图6实施例提供的方法。
请参考图4,图4是本申请实施例示出的一种提取故障传播条件的方法的流程图,该方法包括如下几个步骤。
步骤401:网络设备在多个不同时间获取通信网络对应的多个事件对象连接图,多个不同时间与多个事件对象连接图一一对应,多个事件对象连接图中的每个事件对象连接图用于描述通信网络中发生的、与故障相关的事件,以及与该事件相关的对象之间的连接关系。
由于通信网络中经常会发生不同种类的故障,且不同的故障可能因不同的原因所产生,比如,有的故障是因物理设备的硬件原因所产生,有的故障是因物理设备上部署的协议所产生,因此,在通信网络发生与故障相关的事件时,与该事件相关的对象可能是物理设备、单板、物理端口这些物理节点,也可能是诸如开放最短路径优先(open shortest-path first,OSPF)、边界网关协议(border gateway protocol,BGP)等协议相关的逻辑节点,还有可能是L3link、告警、日志等虚拟节点。另外,通信网络在不同时间可能会发生不同的故障,当发生的故障不同时,与该故障相关的事件就会不同,进而与该事件相关的对象也会不同,因此,通信网络的事件对象连接图可能随着时间的变化而变化,所以,可以在多个不同时间获取通信网络对应的多个事件对象连接图。
在一些实施例中,网络设备可以获取通信网络中发生的故障告警,以及通信网络运行过程中的日志,从日志中提取与故障相关的事件,以及与该事件相关的对象,进而按照提取出的事件,以及与该事件相关的对象之间的关系,生成事件对象连接图。具体实现过程可以参考相关技术。
需要说明的是,上述多个不同时间可以是指多个不同的时刻,也可以是指多个不同的时间段,当然,上述多个不同时间也可以既包括时刻,又包括时间段。也即是,上述多个事件对象连接图可以全部为不同时刻对应的事件对象连接图,也可以全部为不同时间段对应的事件对象连接图,还可以是一部分为不同时刻对应的事件对象连接图,另一部分为不同时间段对应的事件对象连接图。
另外,在本申请实施例中,事件对象连接图可以以图形的形式表示,也可以是用其他形式表示,比如,可以以表项的形式来表示,本申请实施例对事件对象连接图的表示形式不作限定。
步骤402:网络设备根据多个事件对象连接图确定多个子图,多个子图与多个事件对象连接图一一对应,多个子图中的每个子图是对应的事件对象连接图的子集,多个子图中的每个子图中产生第一事件的对象与第一事件相关的任意对象之间的跳数不大于N,与故障相关的事件包括第一事件,N为大于或等于1的整数。
在一些实施例中,对于该多个事件对象连接图中的每个事件对象连接图,网络设备可以从该事件对象连接图中,获取产生第一事件的对象与第一事件相关的任意对象之间的跳数小于或等于N的连接关系。由于第一事件为与故障相关的任一事件,因此,当获取到产生每个事件的对象与每个事件相关的任意对象之间的跳数小于或等于N的连接关系之后, 即可得到该事件对象连接图对应的子图。
在另一些实施例中,对于该多个事件对象连接图中的每个事件对象连接图,网络设备可以从该事件对象连接图中,获取产生第一事件的对象与第一事件相关的任意对象之间的跳数等于N的连接关系。由于第一事件为与故障相关的任一事件,因此,当获取到产生每个事件的对象与每个事件相关的任意对象之间的跳数等于N的连接关系之后,即可得到该事件对象连接图对应的子图。
需要说明的是,上述多个子图中的每个子图中,产生与故障相关的事件的两个对象所连接的路径上不含有发生故障告警的对象。比如,如图5所示,产生与故障相关的事件的两个对象均为OsNetwork,这两个对象所连接的路径上不含有发生故障告警的对象,也即是,这两个对象所连接的路径上不存在与故障相关的事件,而且这两个对象直接连接,因此,这两个对象之间的跳数等于1。如图6所示,产生与故障相关的事件的两个对象为BGP Peer和OsNetwork,这两个对象所连接的路径上不含有发生故障告警的对象,也即是,这两个对象所连接的路径上不存在与故障相关的事件,且这两个对象之间是通过L3link连接的,因此,这两个对象之间的跳数等于2。
步骤403:网络设备根据对象与对象类型之间的对应关系,将多个子图中的每个子图中的对象更新为对应的对象类型,得到多个更新子图,多个更新子图与多个子图一一对应。
在一些实施例中,对于多个子图中的每个子图,网络设备可以从对象与对象类型之间的对应关系中,获取该子图中的对象对应的对象类型,将该子图中的对象用对应的对象类型进行替换,从而得到一个更新子图。
示例性地,对象与对象类型之间的对应关系可以如下表1所示,通过下述表1,将一个子图中的对象更新为对应的对象类型之后,可以得到如图7所示的更新子图。
表1
对象 对象类型
告警、日志 Alarm
OSPF网段 OsNetwork
OSPF路由器 OsRouter
BGP邻居 BGP Peer
VXLAN隧道表 Tunnel
...... ......
需要说明的是,上述表1是本申请实施例提供的一种示例性地对应关系,上述表1所示的对应关系对本申请实施例不构成限定。
步骤404:网络设备根据多个更新子图确定故障传播条件,故障传播条件用于指示故障在通信网络中被传播的路径。
在一些实施例中,网络设备可以根据图嵌入算法,将多个更新子图分别转换为图嵌入向量,得到与多个更新子图一一对应的多个图嵌入向量。根据多个图嵌入向量和聚类算法确定多个子图集合,多个子图集合中的每个子图集合包括多个更新子图中的至少一个更新子图。根据频繁子图挖掘算法,从多个子图集合中的每个子图集合包括的更新子图中提取故障传播条件。
作为一种示例,网络设备根据多个图嵌入向量和聚类算法确定多个子图集合的实现过 程可以为:确定多个图嵌入向量中每两个图嵌入向量之间的相似度。根据确定的相似度和聚类算法对多个更新子图进行聚类,得到多个子图集合。
由于图嵌入向量可以代表更新子图,因此,网络设备根据多个图嵌入向量中每两个图嵌入向量之间的相似度,按照聚类算法可以对多个更新子图进行聚类,得到多个子图集合。
在另一些实施例中,网络设备可以根据频繁子图挖掘算法,从多个更新子图中提取故障传播条件。也即是,网络设备不用进行图嵌入向量的转换,也不需要进行更新子图的聚类,而是直接根据频繁子图挖掘算法,从多个更新子图中提取故障传播条件。当然,本申请实施例是以频繁子图挖掘算法为例进行说明,网络设备也可以按照其他的算法,从多个更新子图中提取故障传播条件,本申请实施例在此不再一一列举。
值得注意的是,网络设备根据频繁子图挖掘算法提取出的故障传播条件的数量可以为0,也可以为1,当然,也可以大于1。而且,有的更新子图中可能提取不出故障传播条件,有的更新子图中可以提取出数量大于或等于1的故障传播条件,且两个或者两个以上的更新子图中也可能会提取出相同的故障传播条件。
需要说明的是,图嵌入算法可以为graph2vec、GNN图神经网络等算法,聚类算法可以为Kmeans、AP等算法,频繁子图挖掘算法可以为gSpan、CloseGraph等算法,本申请实施例对此不做限定。
另外,故障传播条件可以以文本的形式来表示,也可以以图形的形式来表示。比如,对于文本形式的故障传播条件“OsNetwork-L3link-BGPpeer”,该故障传播条件用于指示OSPF网段(OsNetwork)内邻居协议状态故障导致BGP Loopback口IP不可达(L3link),最终导致BGP邻居断链(BGP Peer)。
进一步地,在网络设备根据多个更新子图确定故障传播条件之后,还可以确定提取出的故障传播条件出现的概率和/或故障传播时间。也即是,网络设备可以确定提取出的故障传播条件出现的概率,也可以确定提取出的故障传播条件对应的故障传播时间,还可以确定提取出的故障传播条件出现的概率以及对应的故障传播时间。
在一些实施例中,网络设备确定提取出的故障传播条件对应的故障传播时间的实现过程可以为:网络设备确定第一故障传播条件的起点的告警发生时间和终点的告警发生时间,第一故障传播条件为第一子图集合中提取出的故障传播条件,多个子图集合包括第一子图集合。网络设备将第一故障传播条件的起点的告警发生时间与终点的告警发生时间之间的差值,确定为第一故障传播条件对应的故障传播时间。
基于上述描述,事件对象连接图中包括与故障相关的事件,且与故障相关的事件会产生故障告警,且故障告警通常伴随有告警发生时间,本申请实施例中,事件对象连接图中的、与故障相关的事件中可以携带告警发生时间,进而更新子图中可以携带告警发生时间。因此,网络设备确定第一故障传播条件的起点的告警发生时间和终点的告警发生时间的实现过程可以为:从第一子图集合中确定出现第一故障传播条件的更新子图,从确定的更新子图中获取第一故障传播条件的起点所连接的事件携带的告警发生时间,以及终点所连接的事件携带的告警发生时间。将这些起点所连接的事件携带的告警发生时间的平均值确定为第一故障传播条件的起点的告警发生时间,将这些终点所连接的事件携带的告警发生时间的平均值确定为第一故障传播条件的终点的告警发生时间。
当然,网络设备还可以从第一子图集合中确定出现第一故障传播条件的更新子图,从确定的更新子图中获取第一故障传播条件的起点所连接的事件携带的告警发生时间,以及 终点所连接的事件携带的告警发生时间,确定获取的第一故障传播条件的起点所连接的事件携带的告警发生时间和终点所连接的事件携带的告警发生时间之间的差值,将确定的这些差值的平均值确定为第一故障传播条件对应的故障传播时间。
由于第一子图集合为多个子图集合中的一个子图集合,第一故障传播条件是第一子图集合中提取出的一个故障传播条件,因此,按照上述方法可以确定出每个子图集合中提取出的每个故障传播条件对应的故障传播时间。
比如,网络设备提取出3个故障传播条件,分别为故障传播条件1、故障传播条件2和故障传播条件3。故障传播条件1的起点的告警发生时间为10点20分21秒,终点的告警发生时间为10点21分,那么,故障传播条件1对应的故障传播时间为39秒。同理,故障传播条件2的起点的告警发生时间为10点23分02秒,终点的告警发生时间为10点24分20秒,那么,故障传播条件2对应的故障传播时间为1分18秒。故障传播条件3的起点的告警发生时间为10点22分10秒,终点的告警发生时间为10点22分59秒,那么,故障传播条件3对应的故障传播时间为49秒。
在一些实施例中,网络设备确定故障传播条件出现的概率的实现过程可以为:网络设备确定第一子图集合中出现第一故障传播条件的更新子图的个数,第一故障传播条件为第一子图集合中提取出的故障传播条件,多个子图集合包括第一子图集合。网络设备根据确定的个数与第一子图集合中的更新子图的总数之间的比值,确定第一故障传播条件出现的概率。
由于第一子图集合为多个子图集合中的一个子图集合,第一故障传播条件是第一子图集合中提取出的一个故障传播条件,因此按照上述方法可以确定出每个子图集合中提取出的每个故障传播条件出现的概率。
作为一种示例,网络设备可以直接将确定的个数与第一子图集合中的更新子图的总数之间的比值确定为第一故障传播条件出现的概率。
比如,网络设备从第一子图集合中提取出故障传播条件1,且第一子图集合中出现故障传播条件1的更新子图的个数为20个,第一子图集合中的更新子图的总数为30个,那么,故障传播条件1出现的概率可以为67%。
在另一些实施例中,网络设备可以确定第一故障传播条件在多个更新子图中出现的次数,得到第一次数,提取出的故障传播条件包括第一故障传播条件。确定第一故障传播条件的起点与第二事件之间的连接关系在多个更新子图中出现的次数,得到第二次数,与故障相关的事件包括第二事件,且第二事件为第一故障传播条件对应的事件。根据第一次数与第二次数之间的比值,确定第一故障传播条件出现的概率。
需要说明的是,第一故障传播条件的起点可能连接有多个事件,也即是,第一故障传播条件的起点是产生该多个事件的对象。而事件对象连接图或者更新子图中,与该多个事件相关的对象可能并不完全相同,这样,从第一故障传播条件的起点开始,可能会经由不同的路径到达不同的终点。但是,每条路径对应一个故障传播条件,也对应一个事件,因此,第一故障传播条件会对应一个事件,且第一故障传播条件对应的事件可以是指第一故障传播条件的起点所产生的事件。
作为一种示例,网络设备可以直接将第一次数与第二次数之间的比值确定为第一故障传播条件出现的概率。
在本申请实施例中,网络设备可以通过多个不同时间一一对应的多个事件对象连接图 提取故障传播条件,无需人工总结故障传播条件,可以减少人工成本,进而可以提高故障传播条件的提取效率。而且,这多个不同时间内通信网络发生的故障基本可以覆盖所有的故障类型,从而保证提取出的故障传播条件的故障覆盖率较高,且具备可复制性和可扩展性,可以大量推广。
请参考图8,图8是本申请实施例示出的一种确定故障源的方法的流程图,该方法包括如下几个步骤。
本申请实施例提供的确定故障源的方法可以在图4所示实施例的基础上实现,也即是,在网络设备按照图4所示实施例提取出故障传播条件,并确定故障传播条件出现的概率,以及故障传播条件对应的故障传播时间之后,可以按照下述步骤801-步骤803的方法确定故障源。
步骤801:网络设备根据当前发生故障告警的对象、当前时间通信网络的更新子图和故障传播条件对应的故障传播时间,从提取的故障传播条件中筛选满足条件的故障传播条件。
在一些实施例中,网络设备可以按照下述步骤(1)-(4)确定满足条件的故障传播条件。
(1)从提取的故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第二故障传播条件。
作为一种示例,网络设备可以从提取的故障传播条件中,选择终点为当前发生故障告警的对象的故障传播条件。从选择的故障传播条件中,筛选出所指示的路径存在于当前时间通信网络的更新子图中的故障传播条件,将筛选出的故障传播条件作为终点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第二故障传播条件。
基于上述步骤401的描述,通信网络在不同时间可能对应不同的事件对象连接图,因此,网络设备可以根据当前时间通信网络的事件对象连接图确定当前时间通信网络的更新子图。也即是,网络设备在确定故障源时,可以确定当前时间通信网络的事件对象连接图,根据当前时间通信网络的事件对象连接图确定当前时间的子图,根据对象与对象类型之间的对应关系,将当前时间的子图中的对象更新为对应的对象类型,得到当前时间通信网络的更新子图。
(2)根据当前时间通信网络的更新子图,从第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件。
基于上述描述,事件对象连接图中的告警对象携带有告警发生事件,因此,在转换为更新子图之后,从更新子图中也可以确定出告警发生时间。所以,在一些实施例中,网络设备可以从当前时间通信网络的更新子图中,查找第二故障传播条件的起点是否携带有告警发生时间,将起点携带有告警发生时间的第二故障传播条件确定为第三故障传播条件。
(3)根据当前时间通信网络的更新子图,确定第三故障传播条件对应的本次告警传播时间,本次告警传播时间是指第三故障传播条件的起点的告警发生时间与当前发生的故障告警的告警发生时间之间的差值,第三故障传播条件的起点的告警发生时间是从当前时间通信网络的更新子图中确定的。
(4)从第三故障传播条件中,选择对应的本次告警传播时间与故障传播时间之间的差值小于时间阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。
当第三故障传播条件对应的本次告警传播时间与故障传播时间之间的差值小于时间阈值时,可以说明当前发生的故障告警与第三故障传播条件对应的故障告警相同的概率的比较大,因此,可以将选择的故障传播条件作为满足条件的故障传播条件。
需要说明的是,时间阈值可以根据使用需求设置,比如,2秒,本申请实施例对此不作限定。
步骤802:当满足条件的故障传播条件的数量为1,网络设备将满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。
步骤803:当满足条件的故障传播条件的数量大于1,网络设备将满足条件的故障传播条件中出现的概率最大的故障传播条件的起点确定为当前发生的故障告警的故障源。
由于每个故障传播条件对应一个概率,且故障源通常只有一个,因此,当满足条件的故障传播条件的数量大于1时,可以从满足条件的故障传播条件中选择概率最大的故障传播条件,将概率最大的故障传播条件的起点确定为当前发生的故障告警的故障源。
上述步骤801-步骤804是先根据故障传播时间确定满足条件的故障传播条件之后,再根据概率确定当前发生的故障告警的故障源。当然,也可以先根据概率确定满足条件的故障传播条件,再根据故障传播时间确定当前发生的故障告警的故障源。也即是,网络设备从提取的故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第二故障传播条件。根据当前时间通信网络的更新子图,从第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件。从第三故障传播条件中选择概率大于概率阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。当满足条件的故障传播条件的数量为1,网络设备将满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。当满足条件的故障传播条件的数量大于1,网络设备根据当前时间通信网络的更新子图,确定满足条件的故障传播条件对应的本次告警传播时间,将满足条件的故障传播条件中对应的本次告警传播时间与故障传播时间之间的差值最小的故障传播条件的起点确定为当前发生的故障告警的故障源。
需要说明的是,先根据概率确定满足条件的故障传播条件,再根据故障传播时间确定故障源的过程中的各个步骤的实现过程可以参考步骤801-步骤803中的相关内容,本申请实施例对此不作限定。
不管是先按照故障传播时间再按照概率确定当前发生的故障告警的故障源,还是先按照概率再按照故障传播时间确定当前发生的故障告警的故障源,网络设备在提取出故障传播条件之后,都需要确定故障传播条件出现的概率以及对应的故障传播时间。但是,基于上述步骤404下面的描述可知,网络设备也可以只确定故障传播条件对应的故障传播时间,或者只确定故障传播条件出现的概率。在这种情况下,网络设备可以只根据故障传播时间确定当前发生的故障告警的故障源,或者只根据概率确定当前发生的故障告警的故障源。
其中,网络设备只根据故障传播时间确定当前发生的故障告警的故障源的实现过程可以为:从提取的故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第二故障传播条件。根据当前时间通信网络的更新子图,从第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件。根据当前时间通信网络的更新子图,确定第三故障传播条件对应的本次告警传播时间。从第三故障传播条件中,选择对应的本次告警传播时间与故障传播时间之间的差值小于时间阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。当满足条件的故障 传播条件的数量为1,网络设备将满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。当满足条件的故障传播条件的数量大于1,网络设备将满足条件的故障传播条件中对应的本次告警传播时间与故障传播时间之间的差值最小的故障传播条件的起点确定为当前发生的故障告警的故障源。
其中,网络设备只根据概率确定当前发生的故障告警的故障源的实现过程可以为:从提取的故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第二故障传播条件。根据当前时间通信网络的更新子图,从第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件。根据当前时间通信网络的更新子图,确定第三故障传播条件对应的本次告警传播时间。从第三故障传播条件中选择概率大于概率阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。当满足条件的故障传播条件的数量为1,网络设备将满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。当满足条件的故障传播条件的数量大于1,网络设备将满足条件的故障传播条件中概率最大的故障传播条件的起点确定为当前发生的故障告警的故障源。
在本申请实施例中,由于故障传播条件是按照多个不同时间一一对应的多个事件对象连接图提取的,因此,可以保证提取出的故障传播条件的准确度,进而可以保证按照提取的故障传播条件确定的故障源的准确度。而且,由于提取出的故障传播条件的故障覆盖率较高,因此,按照提取出的故障传播条件能够确定出故障源的概率也比较大。
请参考图9,图9是本申请实施例示出的一种预测故障传播范围的方法的流程图,该方法是根据当前发生故障告警的对象、当前时间通信网络的更新子图和提取的故障传播条件,预测故障受影响对象。其中,故障受影响对象是指受当前发生的故障告警的影响而发生故障告警的对象。该方法包括如下几个步骤。
步骤901:网络设备从提取的故障传播条件中,选择起点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第四故障传播条件。
作为一种示例,网络设备可以从提取的故障传播条件中,选择起点为当前发生故障告警的对象的故障传播条件。从选择的故障传播条件中,筛选出所指示的路径存在于当前时间通信网络的更新子图中的故障传播条件,将筛选出的故障传播条件作为起点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第四故障传播条件。
基于上述步骤401的描述,通信网络在不同时间可能对应不同的事件对象连接图,因此,网络设备可以根据当前时间通信网络的事件对象连接图确定当前时间通信网络的更新子图。也即是,网络设备在确定故障源时,可以确定当前时间通信网络的事件对象连接图,根据当前时间通信网络的事件对象连接图确定当前时间的子图,根据对象与对象类型之间的对应关系,将当前时间的子图中的对象更新为对应的对象类型,得到当前时间通信网络的更新子图。
步骤902:网络设备将第四故障传播条件的终点确定为故障受影响对象。
步骤903:网络设备根据第四故障传播条件对应的故障传播时间和当前发生的故障告警的告警发生时间,预测故障受影响对象发生故障告警的时间。
由于故障传播时间是指故障传播条件的起点的告警发生时间与终点的告警发生时间之间的差值,因此,当预测出故障受影响对象之后,可以根据第四故障传播条件对应的故 障传播时间和当前发生的故障告警的告警发生时间,预测故障受影响对象发生故障告警的时间。
作为一种示例,可以将当前发生的故障告警的告警发生时间与第四故障传播条件对应的故障传播时间之和,确定为故障受影响对象发生故障告警的时间。
在本申请实施例中,由于故障传播条件是按照多个不同时间一一对应的多个事件对象连接图提取的,因此,可以保证提取出的故障传播条件的准确度,进而可以保证按照提取的故障传播条件预测故障传播范围的准确度。而且,由于提取出的故障传播条件的故障覆盖率较高,因此,按照提取出的故障传播条件能够预测出故障传播范围的概率也比较大。
请参考图10,图10是本申请实施例示出的一种提取故障传播条件的装置的结构示意图,该装置可以由软件、硬件或者两者的结合实现成为网络设备的部分或者全部,该网络设备可以为图1部分内容所描述的网络设备。该装置包括:获取模块1001、第一确定模块1002、更新模块1003和第二确定模块1004。
获取模块1001,用于执行图4所示实施例中的步骤401的操作;
第一确定模块1002,用于执行图4所示实施例中的步骤402的操作;
更新模块1003,用于执行图4所示实施例中的步骤403的操作;
第二确定模块1004,用于执行图4所示实施例中的步骤404的操作。
可选地,第二确定模块1004包括:
转换子模块,用于根据图嵌入算法,将多个更新子图分别转换为图嵌入向量,得到与多个更新子图一一对应的多个图嵌入向量;
第一确定子模块,用于根据多个图嵌入向量和聚类算法确定多个子图集合,多个子图集合中的每个子图集合包括多个更新子图中的至少一个更新子图;
第一提取子模块,用于根据频繁子图挖掘算法,从多个子图集合中的每个子图集合包括的更新子图中提取故障传播条件。
可选地,第一确定子模块用于:
确定多个图嵌入向量中每两个图嵌入向量之间的相似度;
根据相似度和聚类算法对多个更新子图进行聚类,得到多个子图集合。
可选地,第二确定模块1004包括:
第二提取子模块,用于根据频繁子图挖掘算法,从多个更新子图中提取故障传播条件。
可选地,该装置还包括:
第三确定模块,用于确定故障传播条件对应的故障传播时间;
筛选模块,用于根据当前发生故障告警的对象、当前时间通信网络的更新子图和故障传播条件对应的故障传播时间,从故障传播条件中筛选满足条件的故障传播条件;
第四确定模块,用于当满足条件的故障传播条件的数量为1,将满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。
可选地,第三确定模块包括:
第二确定子模块,用于确定第一故障传播条件的起点的告警发生时间和终点的告警发生时间,第一故障传播条件为第一子图集合中提取出的故障传播条件,多个子图集合包括第一子图集合;
第三确定子模块,用于将第一故障传播条件的起点的告警发生时间与终点的告警发生 时间之间的差值,确定为第一故障传播条件对应的故障传播时间。
可选地,筛选模块包括:
第一选择子模块,用于从故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第二故障传播条件;
第二选择子模块,用于根据当前时间通信网络的更新子图,从第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件;
第四确定子模块,用于根据当前时间通信网络的更新子图,确定第三故障传播条件对应的本次告警传播时间,本次告警传播时间是指第三故障传播条件的起点的告警发生时间与当前发生的故障告警的告警发生时间之间的差值,第三故障传播条件的起点的告警发生时间是从当前时间通信网络的更新子图中确定的;
第三选择子模块,用于从第三故障传播条件中,选择对应的本次告警传播时间与故障传播时间之间的差值小于时间阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。
可选地,该装置还包括:
第五确定模块,用于确定故障传播条件出现的概率;
第六确定模块,用于当满足条件的故障传播条件的数量大于1,将满足条件的故障传播条件中出现的概率最大的故障传播条件的起点确定为当前发生的故障告警的故障源。
可选地,故障传播条件是根据频繁子图挖掘算法,从多个子图集合中的每个子图集合包括的更新子图中提取的;
第五确定模块包括:
第五确定子模块,用于确定第一子图集合中出现第一故障传播条件的更新子图的个数,第一故障传播条件为第一子图集合中提取出的故障传播条件,多个子图集合包括第一子图集合;
第六确定子模块,用于根据个数与第一子图集合中的更新子图的总数之间的比值,确定第一故障传播条件出现的概率。
可选地,第五确定模块包括:
第七确定子模块,用于确定第一故障传播条件在多个更新子图中出现的次数,得到第一次数,提取出的故障传播条件包括第一故障传播条件;
第八确定子模块,用于确定第一故障传播条件的起点与第二事件之间的连接关系在多个更新子图中出现的次数,得到第二次数,与故障相关的事件包括第二事件,且第二事件为第一故障传播条件对应的事件;
第九确定子模块,用于根据第一次数与第二次数之间的比值,确定第一故障传播条件出现的概率。
可选地,该装置还包括:
第一预测模块,用于根据当前发生故障告警的对象、当前时间通信网络的更新子图和故障传播条件,预测故障受影响对象,故障受影响对象是指受当前发生的故障告警的影响而发生故障告警的对象。
可选地,第一预测模块包括:
第四选择子模块,用于网络设备从故障传播条件中,选择起点为当前发生故障告警的对象且能够与当前时间通信网络的更新子图匹配的第四故障传播条件;
第七确定子模块,用于将第四故障传播条件的终点确定为故障受影响对象。
可选地,该装置还包括:
第七确定模块,用于确定故障传播条件对应的故障传播时间;
第二预测模块,用于根据第四故障传播条件对应的故障传播时间和当前发生的故障告警的告警发生时间,预测故障受影响对象发生故障告警的时间。
在本申请实施例中,可以通过多个不同时间一一对应的多个事件对象连接图提取故障传播条件,无需人工总结故障传播条件,可以减少人工成本,进而可以提高故障传播条件的提取效率。而且,这多个不同时间内通信网络发生的故障基本可以覆盖所有的故障类型,从而保证提取出的故障传播条件的故障覆盖率较高,且具备可复制性和可扩展性,可以大量推广。
需要说明的是:上述实施例提供的提取故障传播条件的装置在提取故障传播条件时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的提取故障传播条件装置与提取故障传播条件方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))或半导体介质(例如:固态硬盘(solid state disk,SSD))等。值得注意的是,本申请提到的计算机可读存储介质可以为非易失性存储介质,换句话说,可以是非瞬时性存储介质。
应当理解的是,本文提及的“多个”是指两个或两个以上。在本申请的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (28)

  1. 一种提取故障传播条件的方法,其特征在于,所述方法包括:
    网络设备在多个不同时间获取通信网络对应的多个事件对象连接图,所述多个不同时间与所述多个事件对象连接图一一对应,所述多个事件对象连接图中的每个事件对象连接图用于描述所述通信网络中发生的、与故障相关的事件,以及与所述事件相关的对象之间的连接关系;
    所述网络设备根据所述多个事件对象连接图确定多个子图,所述多个子图与所述多个事件对象连接图一一对应,所述多个子图中的每个子图是对应的事件对象连接图的子集,所述多个子图中的每个子图中产生第一事件的对象与所述第一事件相关的任意对象之间的跳数不大于N,所述事件包括所述第一事件,所述N为大于或等于1的整数;
    所述网络设备根据对象与对象类型之间的对应关系,将所述多个子图中的每个子图中的对象更新为对应的对象类型,得到多个更新子图,所述多个更新子图与所述多个子图一一对应;
    所述网络设备根据所述多个更新子图确定故障传播条件,所述故障传播条件用于指示故障在所述通信网络中被传播的路径。
  2. 如权利要求1所述的方法,其特征在于,所述网络设备根据所述多个更新子图确定故障传播条件,包括:
    所述网络设备根据图嵌入算法,将所述多个更新子图分别转换为图嵌入向量,得到与所述多个更新子图一一对应的多个图嵌入向量;
    所述网络设备根据所述多个图嵌入向量和聚类算法确定多个子图集合,所述多个子图集合中的每个子图集合包括所述多个更新子图中的至少一个更新子图;
    所述网络设备根据频繁子图挖掘算法,从所述多个子图集合中的每个子图集合包括的更新子图中提取所述故障传播条件。
  3. 如权利要求2所述的方法,其特征在于,所述网络设备根据所述多个图嵌入向量和聚类算法确定多个子图集合,包括:
    所述网络设备确定所述多个图嵌入向量中每两个图嵌入向量之间的相似度;
    所述网络设备根据所述相似度和所述聚类算法对所述多个更新子图进行聚类,得到所述多个子图集合。
  4. 如权利要求1所述的方法,其特征在于,所述网络设备根据所述多个更新子图确定故障传播条件,包括:
    所述网络设备根据频繁子图挖掘算法,从所述多个更新子图中提取所述故障传播条件。
  5. 如权利要求2-4任一所述的方法,其特征在于,在所述网络设备根据所述多个更新子图确定故障传播条件之后,所述方法还包括:
    所述网络设备确定所述故障传播条件对应的故障传播时间;
    所述方法还包括:
    所述网络设备根据当前发生故障告警的对象、当前时间所述通信网络的更新子图和所述故障传播条件对应的故障传播时间,从所述故障传播条件中筛选满足条件的故障传播条 件;
    当所述满足条件的故障传播条件的数量为1,所述网络设备将所述满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。
  6. 如权利要求5所述的方法,其特征在于,所述网络设备确定所述故障传播条件对应的故障传播时间,包括:
    所述网络设备确定第一故障传播条件的起点的告警发生时间和终点的告警发生时间,所述第一故障传播条件为第一子图集合中提取出的故障传播条件,所述多个子图集合包括所述第一子图集合;
    所述网络设备将所述第一故障传播条件的起点的告警发生时间与所述终点的告警发生时间之间的差值,确定为所述第一故障传播条件对应的故障传播时间。
  7. 如权利要求5或6所述的方法,其特征在于,所述网络设备根据当前发生故障告警的对象、当前时间所述通信网络的更新子图和所述故障传播条件对应的故障传播时间,从所述故障传播条件中筛选满足条件的故障传播条件,包括:
    所述网络设备从所述故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间所述通信网络的更新子图匹配的第二故障传播条件;
    所述网络设备根据当前时间所述通信网络的更新子图,从所述第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件;
    所述网络设备根据当前时间所述通信网络的更新子图,确定所述第三故障传播条件对应的本次告警传播时间,所述本次告警传播时间是指所述第三故障传播条件的起点的告警发生时间与当前发生的故障告警的告警发生时间之间的差值,所述第三故障传播条件的起点的告警发生时间是从当前时间所述通信网络的更新子图中确定的;
    所述网络设备从所述第三故障传播条件中,选择对应的本次告警传播时间与故障传播时间之间的差值小于时间阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。
  8. 如权利要求5所述的方法,其特征在于,所述方法还包括:
    所述网络设备确定所述故障传播条件出现的概率;
    当所述满足条件的故障传播条件的数量大于1,所述网络设备将所述满足条件的故障传播条件中出现的概率最大的故障传播条件的起点确定为当前发生的故障告警的故障源。
  9. 如权利要求8所述的方法,其特征在于,所述故障传播条件是所述网络设备根据频繁子图挖掘算法,从多个子图集合中的每个子图集合包括的更新子图中提取的;
    所述网络设备确定所述故障传播条件出现的概率,包括:
    所述网络设备确定第一子图集合中出现第一故障传播条件的更新子图的个数,所述第一故障传播条件为所述第一子图集合中提取出的故障传播条件,所述多个子图集合包括所述第一子图集合;
    所述网络设备根据所述个数与所述第一子图集合中的更新子图的总数之间的比值,确定所述第一故障传播条件出现的概率。
  10. 如权利要求8所述的方法,其特征在于,所述网络设备确定所述故障传播条件出现的概率,包括:
    所述网络设备确定第一故障传播条件在所述多个更新子图中出现的次数,得到第一次数,所述故障传播条件包括所述第一故障传播条件;
    所述网络设备确定所述第一故障传播条件的起点与第二事件之间的连接关系在所述多个更新子图中出现的次数,得到第二次数,所述事件包括所述第二事件,且所述第二事件为所述第一故障传播条件对应的事件;
    所述网络设备根据所述第一次数与所述第二次数之间的比值,确定所述第一故障传播条件出现的概率。
  11. 如权利要求1-10任一所述的方法,其特征在于,在所述网络设备根据所述多个更新子图确定故障传播条件之后,所述方法还包括:
    所述网络设备根据当前发生故障告警的对象、当前时间所述通信网络的更新子图和所述故障传播条件,预测故障受影响对象,所述故障受影响对象是指受当前发生的故障告警的影响而发生故障告警的对象。
  12. 如权利要求11所述的方法,其特征在于,所述网络设备根据当前发生故障告警的对象和所述故障传播条件,预测故障受影响对象,包括:
    所述网络设备从所述故障传播条件中,选择起点为当前发生故障告警的对象且能够与当前时间所述通信网络的更新子图匹配的第四故障传播条件;
    所述网络设备将所述第四故障传播条件的终点确定为所述故障受影响对象。
  13. 如权利要求12所述的方法,其特征在于,所述方法还包括:
    所述网络设备确定所述故障传播条件对应的故障传播时间;
    所述网络设备根据所述第四故障传播条件对应的故障传播时间和当前发生的故障告警的告警发生时间,预测所述故障受影响对象发生故障告警的时间。
  14. 一种提取故障传播条件的装置,其特征在于,所述装置包括:
    获取模块,用于在多个不同时间获取通信网络对应的多个事件对象连接图,所述多个不同时间与所述多个事件对象连接图一一对应,所述多个事件对象连接图中的每个事件对象连接图用于描述所述通信网络中发生的、与故障相关的事件,以及与所述事件相关的对象之间的连接关系;
    第一确定模块,用于根据所述多个事件对象连接图确定多个子图,所述多个子图与所述多个事件对象连接图一一对应,所述多个子图中的每个子图是对应的事件对象连接图的子集,所述多个子图中的每个子图中产生第一事件的对象与所述第一事件相关的任意对象之间的跳数不大于N,所述事件包括所述第一事件,所述N为大于或等于1的整数;
    更新模块,用于根据对象与对象类型之间的对应关系,将所述多个子图中的每个子图中的对象更新为对应的对象类型,得到多个更新子图,所述多个更新子图与所述多个子图一一对应;
    第二确定模块,用于根据所述多个更新子图确定故障传播条件,所述故障传播条件用 于指示故障在所述通信网络中被传播的路径。
  15. 如权利要求14所述的装置,其特征在于,所述第二确定模块包括:
    转换子模块,用于根据图嵌入算法,将所述多个更新子图分别转换为图嵌入向量,得到与所述多个更新子图一一对应的多个图嵌入向量;
    第一确定子模块,用于根据所述多个图嵌入向量和聚类算法确定多个子图集合,所述多个子图集合中的每个子图集合包括所述多个更新子图中的至少一个更新子图;
    第一提取子模块,用于根据频繁子图挖掘算法,从所述多个子图集合中的每个子图集合包括的更新子图中提取所述故障传播条件。
  16. 如权利要求15所述的装置,其特征在于,所述第一确定子模块用于:
    确定所述多个图嵌入向量中每两个图嵌入向量之间的相似度;
    根据所述相似度和所述聚类算法对所述多个更新子图进行聚类,得到所述多个子图集合。
  17. 如权利要求14所述的装置,其特征在于,所述第二确定模块包括:
    第二提取子模块,用于根据频繁子图挖掘算法,从所述多个更新子图中提取所述故障传播条件。
  18. 如权利要求15-17任一所述的装置,其特征在于,所述装置还包括:
    第三确定模块,用于确定所述故障传播条件对应的故障传播时间;
    筛选模块,用于根据当前发生故障告警的对象、当前时间所述通信网络的更新子图和所述故障传播条件对应的故障传播时间,从所述故障传播条件中筛选满足条件的故障传播条件;
    第四确定模块,用于当所述满足条件的故障传播条件的数量为1,将所述满足条件的故障传播条件的起点确定为当前发生的故障告警的故障源。
  19. 如权利要求18所述的装置,其特征在于,所述第三确定模块包括:
    第二确定子模块,用于确定第一故障传播条件的起点的告警发生时间和终点的告警发生时间,所述第一故障传播条件为第一子图集合中提取出的故障传播条件,所述多个子图集合包括所述第一子图集合;
    第三确定子模块,用于将所述第一故障传播条件的起点的告警发生时间与所述终点的告警发生时间之间的差值,确定为所述第一故障传播条件对应的故障传播时间。
  20. 如权利要求18或19所述的装置,其特征在于,所述筛选模块包括:
    第一选择子模块,用于从所述故障传播条件中,选择终点为当前发生故障告警的对象且能够与当前时间所述通信网络的更新子图匹配的第二故障传播条件;
    第二选择子模块,用于根据当前时间所述通信网络的更新子图,从所述第二故障传播条件中选择起点在当前时间之前发生过故障告警的第三故障传播条件;
    第四确定子模块,用于根据当前时间所述通信网络的更新子图,确定所述第三故障传 播条件对应的本次告警传播时间,所述本次告警传播时间是指所述第三故障传播条件的起点的告警发生时间与当前发生的故障告警的告警发生时间之间的差值,所述第三故障传播条件的起点的告警发生时间是从当前时间所述通信网络的更新子图中确定的;
    第三选择子模块,用于从所述第三故障传播条件中,选择对应的本次告警传播时间与故障传播时间之间的差值小于时间阈值的故障传播条件,将选择的故障传播条件作为满足条件的故障传播条件。
  21. 如权利要求18所述的装置,其特征在于,所述装置还包括:
    第五确定模块,用于确定所述故障传播条件出现的概率;
    第六确定模块,用于当所述满足条件的故障传播条件的数量大于1,将所述满足条件的故障传播条件中出现的概率最大的故障传播条件的起点确定为当前发生的故障告警的故障源。
  22. 如权利要求21所述的装置,其特征在于,所述故障传播条件是根据频繁子图挖掘算法,从多个子图集合中的每个子图集合包括的更新子图中提取的;
    所述第五确定模块包括:
    第五确定子模块,用于确定第一子图集合中出现第一故障传播条件的更新子图的个数,所述第一故障传播条件为所述第一子图集合中提取出的故障传播条件,所述多个子图集合包括所述第一子图集合;
    第六确定子模块,用于根据所述个数与所述第一子图集合中的更新子图的总数之间的比值,确定所述第一故障传播条件出现的概率。
  23. 如权利要求21所述的装置,其特征在于,所述第五确定模块包括:
    第七确定子模块,用于确定第一故障传播条件在所述多个更新子图中出现的次数,得到第一次数,所述故障传播条件包括所述第一故障传播条件;
    第八确定子模块,用于确定所述第一故障传播条件的起点与第二事件之间的连接关系在所述多个更新子图中出现的次数,得到第二次数,所述事件包括所述第二事件,且所述第二事件为所述第一故障传播条件对应的事件;
    第九确定子模块,用于根据所述第一次数与所述第二次数之间的比值,确定所述第一故障传播条件出现的概率。
  24. 如权利要求14-23任一所述的装置,其特征在于,所述装置还包括:
    第一预测模块,用于根据当前发生故障告警的对象、当前时间所述通信网络的更新子图和所述故障传播条件,预测故障受影响对象,所述故障受影响对象是指受当前发生的故障告警的影响而发生故障告警的对象。
  25. 如权利要求24所述的装置,其特征在于,所述第一预测模块包括:
    第四选择子模块,用于所述网络设备从所述故障传播条件中,选择起点为当前发生故障告警的对象且能够与当前时间所述通信网络的更新子图匹配的第四故障传播条件;
    第七确定子模块,用于将所述第四故障传播条件的终点确定为所述故障受影响对象。
  26. 如权利要求25所述的装置,其特征在于,所述装置还包括:
    第七确定模块,用于确定所述故障传播条件对应的故障传播时间;
    第二预测模块,用于根据所述第四故障传播条件对应的故障传播时间和当前发生的故障告警的告警发生时间,预测所述故障受影响对象发生故障告警的时间。
  27. 一种网络设备,其特征在于,所述网络设备包括处理器和网络接口,所述网络接口用于获取实现权利要求1-13任一所述的方法所涉及的数据,所述处理器用于根据所述网络接口获取的数据,执行权利要求1-13任一所述的方法的步骤。
  28. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-13任一所述的方法的步骤。
PCT/CN2020/115701 2019-09-17 2020-09-16 提取故障传播条件的方法、装置及存储介质 WO2021052380A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20865768.4A EP4024765B1 (en) 2019-09-17 2020-09-16 Method and apparatus for extracting fault propagation condition, and storage medium
US17/655,107 US20220207383A1 (en) 2019-09-17 2022-03-16 Fault propagation condition extraction method and apparatus and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910877916.8A CN112532408B (zh) 2019-09-17 2019-09-17 提取故障传播条件的方法、装置及存储介质
CN201910877916.8 2019-09-17

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/655,107 Continuation US20220207383A1 (en) 2019-09-17 2022-03-16 Fault propagation condition extraction method and apparatus and storage medium

Publications (1)

Publication Number Publication Date
WO2021052380A1 true WO2021052380A1 (zh) 2021-03-25

Family

ID=74884529

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/115701 WO2021052380A1 (zh) 2019-09-17 2020-09-16 提取故障传播条件的方法、装置及存储介质

Country Status (4)

Country Link
US (1) US20220207383A1 (zh)
EP (1) EP4024765B1 (zh)
CN (1) CN112532408B (zh)
WO (1) WO2021052380A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434326A (zh) * 2021-07-12 2021-09-24 国泰君安证券股份有限公司 基于分布式集群拓扑实现网络系统故障定位的方法及装置、处理器及其计算机可读存储介质
CN113434326B (zh) * 2021-07-12 2024-05-31 国泰君安证券股份有限公司 基于分布式集群拓扑实现网络系统故障定位的方法及装置、处理器及其计算机可读存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277357A (zh) * 2021-04-30 2022-11-01 华为技术有限公司 网络故障分析方法、装置、设备及存储介质
US20220394546A1 (en) * 2021-06-07 2022-12-08 At&T Intellectual Property I, L.P. Apparatuses and methods for identifying impacts on quality of service based on relationships between communication nodes
US20230095270A1 (en) * 2021-09-24 2023-03-30 Bmc Software, Inc. Probabilistic root cause analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794013A (zh) * 2015-03-20 2015-07-22 百度在线网络技术(北京)有限公司 定位系统运行状态、建立系统运行状态模型的方法及装置
US20180103052A1 (en) * 2016-10-11 2018-04-12 Battelle Memorial Institute System and methods for automated detection, reasoning and recommendations for resilient cyber systems
CN108320040A (zh) * 2017-01-17 2018-07-24 国网重庆市电力公司 基于贝叶斯网络优化算法的采集终端故障预测方法及系统
CN108964960A (zh) * 2017-05-27 2018-12-07 阿里巴巴集团控股有限公司 一种告警事件的处理方法及装置
CN109684181A (zh) * 2018-11-20 2019-04-26 华为技术有限公司 告警根因分析方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10474520B2 (en) * 2013-04-29 2019-11-12 Moogsoft, Inc. Methods for decomposing events from managed infrastructures
CN107666468B (zh) * 2016-07-29 2020-08-04 中国电信股份有限公司 网络安全检测方法和装置
CN108322320B (zh) * 2017-01-18 2020-04-28 华为技术有限公司 业务生存性分析方法及装置
CN108073946A (zh) * 2017-11-29 2018-05-25 东北大学 一种面向图数据的投影聚类方法
CN108762908B (zh) * 2018-05-31 2021-12-07 创新先进技术有限公司 系统调用异常检测方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794013A (zh) * 2015-03-20 2015-07-22 百度在线网络技术(北京)有限公司 定位系统运行状态、建立系统运行状态模型的方法及装置
US20180103052A1 (en) * 2016-10-11 2018-04-12 Battelle Memorial Institute System and methods for automated detection, reasoning and recommendations for resilient cyber systems
CN108320040A (zh) * 2017-01-17 2018-07-24 国网重庆市电力公司 基于贝叶斯网络优化算法的采集终端故障预测方法及系统
CN108964960A (zh) * 2017-05-27 2018-12-07 阿里巴巴集团控股有限公司 一种告警事件的处理方法及装置
CN109684181A (zh) * 2018-11-20 2019-04-26 华为技术有限公司 告警根因分析方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4024765A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434326A (zh) * 2021-07-12 2021-09-24 国泰君安证券股份有限公司 基于分布式集群拓扑实现网络系统故障定位的方法及装置、处理器及其计算机可读存储介质
CN113434326B (zh) * 2021-07-12 2024-05-31 国泰君安证券股份有限公司 基于分布式集群拓扑实现网络系统故障定位的方法及装置、处理器及其计算机可读存储介质

Also Published As

Publication number Publication date
EP4024765A4 (en) 2022-10-19
EP4024765B1 (en) 2023-11-22
CN112532408A (zh) 2021-03-19
EP4024765A1 (en) 2022-07-06
CN112532408B (zh) 2022-05-24
US20220207383A1 (en) 2022-06-30

Similar Documents

Publication Publication Date Title
WO2021052380A1 (zh) 提取故障传播条件的方法、装置及存储介质
US10594582B2 (en) Introspection driven monitoring of multi-container applications
CN108712286B (zh) 网络拓扑结构的确定方法、装置和存储介质
US10523540B2 (en) Display method of exchanging messages among users in a group
US10484265B2 (en) Dynamic update of virtual network topology
US10033623B2 (en) Multithreaded system and method for establishing network connections
US9838483B2 (en) Methods, systems, and computer readable media for a network function virtualization information concentrator
CN108833202B (zh) 故障链路检测方法、装置和计算机可读存储介质
US8656219B2 (en) System and method for determination of the root cause of an overall failure of a business application service
CN108322320B (zh) 业务生存性分析方法及装置
US20150215228A1 (en) Methods, systems, and computer readable media for a cloud-based virtualization orchestrator
US9450819B2 (en) Autonomic network sentinels
WO2021147320A1 (zh) 路由异常检测方法、装置及系统、计算机存储介质
JP6549959B2 (ja) 障害切り分け方法および障害切り分けを行う管理サーバ
CN109361547B (zh) 一种网络切片链路部署方法与装置
US20140006554A1 (en) System management apparatus, system management method, and storage medium
CN111669282B (zh) 识别疑似根因告警的方法、装置及计算机存储介质
WO2021103800A1 (zh) 故障修复操作推荐方法、装置及存储介质
CN116886496A (zh) 基于dpu的数据处理方法、装置、设备及可读存储介质
JP7056207B2 (ja) トポロジ決定装置、トポロジ決定方法、トポロジ決定プログラムおよび通信システム
CN113364681B (zh) 网络路径确定方法、装置、电子设备、介质和程序产品
CN110932975B (zh) 流表下发方法、数据转发方法、装置及电子设备
US11671323B1 (en) Preplan assignment generation
CN115150253B (zh) 一种故障根因确定方法、装置及电子设备
US20220385630A1 (en) Advertising device inspection capabilities to enhance network traffic inspections

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20865768

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020865768

Country of ref document: EP

Effective date: 20220331