WO2021159676A1 - 一种数据处理方法及相关设备 - Google Patents

一种数据处理方法及相关设备 Download PDF

Info

Publication number
WO2021159676A1
WO2021159676A1 PCT/CN2020/108424 CN2020108424W WO2021159676A1 WO 2021159676 A1 WO2021159676 A1 WO 2021159676A1 CN 2020108424 W CN2020108424 W CN 2020108424W WO 2021159676 A1 WO2021159676 A1 WO 2021159676A1
Authority
WO
WIPO (PCT)
Prior art keywords
propagation path
analysis device
result
node
target
Prior art date
Application number
PCT/CN2020/108424
Other languages
English (en)
French (fr)
Inventor
肖欣
谢于明
王仲宇
高云鹏
宋伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20919283.0A priority Critical patent/EP4084411A4/en
Publication of WO2021159676A1 publication Critical patent/WO2021159676A1/zh
Priority to US17/875,809 priority patent/US20220376971A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design

Definitions

  • the embodiments of the present application relate to the field of communication technology, and in particular, to a data processing method and related equipment.
  • Network failure refers to a state in which the network cannot provide normal services or reduce service quality due to hardware problems, software vulnerabilities, and virus intrusion.
  • APR address resolution protocol
  • ID router identity
  • the newly generated network data and all historical network data are processed to obtain the fault propagation path, which indicates the path through which the fault is propagated in the network.
  • the embodiments of the present application provide a data processing method and related equipment, which are more efficient than processing all network data; compared to storing all historical network data, only the historical fault propagation path is stored, which reduces storage costs and has the ability to be replicated. And scalability.
  • the first aspect of the embodiments of the present application provides a data processing method, including: an analysis device obtains first network data, the first network data includes information on abnormal events of multiple nodes in the network during a first time period and information about the multiple nodes. Connection relationship; the analysis device processes the first network data to obtain the first fault propagation path, the first fault propagation path indicates that the first abnormal event that occurs at the first node in the first time period causes the second abnormal event to occur at the second node, The first node and the second node are any two different nodes among multiple nodes; the analysis device obtains the historical fault propagation path; the analysis device determines whether the historical fault propagation path includes the same target fault propagation path as the first fault propagation path, and the target The fault propagation path means that the third abnormal event that occurred on the first node before the first time period causes the fourth abnormal event to occur on the second node.
  • the third abnormal event and the first abnormal event are of the same event type, and the fourth abnormal event is the same as the second Abnormal events are of the same event type; when the historical failure propagation path includes the target failure propagation path, analyze the number of times the device updates the target failure propagation path.
  • the analysis device obtains the first fault propagation path by acquiring the first network data and the historical fault propagation path, and processing the first network data to obtain the first fault propagation path. After determining that the historical fault propagation path includes the same first fault propagation path, When the second fault propagation path is used, the number of times to update the second fault propagation path is improved compared to the efficiency of processing all network data; compared to storing all historical network data, only the historical fault propagation path is stored, reducing storage costs, and can be replicated And scalability.
  • the nodes passed by the target fault propagation path are the same as the nodes passed by the first fault propagation path.
  • the target fault propagation path when the node passed by the target fault propagation path is the same as the node passed by the first fault propagation path, it is determined that the target fault propagation path is the same as the first fault propagation path, making subsequent fault location and troubleshooting more detailed and accurate .
  • the analysis device can process the first network data to obtain the first
  • the first result includes the first fault propagation path and the first time length.
  • the first time length is determined from the occurrence time of the first abnormal event of the first node in the first fault propagation path and the first fault propagation path in the first time period.
  • the first time interval between the occurrence time of the second abnormal event of the second node is processed.
  • the analysis device can obtain historical results.
  • the historical results include the second result.
  • the second result includes the target failure propagation path and the second duration corresponding to the target failure propagation path.
  • the second time interval between the occurrence time of the fourth abnormal event is processed.
  • the analysis device calculates the first result and the second result to obtain the target duration.
  • the analysis device updates the second duration to the target duration.
  • the time length of the fault propagation path is updated through an incremental update method, which provides a reference for the subsequent prediction of the impact time length of the fault.
  • the analysis device uses the maximum duration of the first duration and the second duration as the target duration.
  • the target duration is limited to the maximum duration, which improves the achievability of the solution.
  • the first result may also include a first number of times, which is the number of occurrences of the first fault propagation path in the first time period; the second result may also include a second number of times, which is the first time The number of occurrences of the target failure propagation path before the segment.
  • the analysis device calculates the target duration in the following way:
  • a calculation method of the target duration is limited, which improves the achievability of the solution.
  • the first result further includes a third number, and the third number is the number of abnormal events of the same type as the second abnormal event at the second node during the first time period; the second result also includes the fourth number, the first The four times is the number of abnormal events of the same type as the fourth abnormal event that occurred on the second node before the first time period.
  • the analysis device calculates the first result and the second result to obtain the target probability.
  • the analysis device updates the probability of the target failure propagation path to the target probability.
  • the probability of the target failure propagation path is updated to the target probability through an incremental update method, which is beneficial to improve the accuracy of subsequent failure root cause determination.
  • the analysis device processes the first network data to obtain the first result.
  • the first result includes the first fault propagation path and the third number of times.
  • the third number is the occurrence of the second node and the second abnormal event in the first time period.
  • the analysis device obtains the historical result, the historical result includes the second result, the second result includes the target fault propagation path and the fourth number, the fourth number is the same number of abnormal events at the second node as the second node before the first time period .
  • the analysis device calculates the first result and the second result to obtain the target probability.
  • the analysis device updates the probability of the target failure propagation path to the target probability.
  • the probability of the target failure propagation path is updated to the target probability through an incremental update method, which is beneficial to improve the accuracy of subsequent root cause determination of the failure.
  • the analysis device calculates the target probability by the following method:
  • a calculation method of the target probability is limited, which improves the feasibility of the solution.
  • the analysis device saves the first failure propagation path.
  • the first fault propagation path is saved to provide a new reference for subsequent troubleshooting.
  • the analysis device processes the first network data to obtain a first result.
  • the first result includes a first fault propagation path and a first duration.
  • the first duration is determined by the first node in the first fault propagation path within the first time period.
  • the first time interval between the alarm occurrence time corresponding to a failure and the alarm occurrence time corresponding to the second failure of the second node in the first failure propagation path is processed and obtained.
  • the analysis device saves the first length of time.
  • the first duration is saved, which provides a new reference for subsequent troubleshooting.
  • the first result includes the first number and the third number.
  • the first number is the number of occurrences of the first fault propagation path in the first time period
  • the third number is the occurrence of the second node and the third number in the first time period.
  • the number of abnormal events of the same event type of the abnormal event; the third number of times is the number of abnormal events of the same event type of the second abnormal event at the second node during the first time period.
  • the analysis device saves the first probability of the propagation path of the first fault, and the first probability is the probability that the second abnormal event at the second node is caused by the first abnormal event of the first node.
  • the first probability is saved, which provides a new reference for subsequent troubleshooting.
  • the analysis device processes the first network data based on the frequent subgraph mining algorithm to obtain the first fault propagation path.
  • a method for processing the first network data is defined, which improves the feasibility of the solution.
  • the first implementation manner of the first aspect of the embodiments of the present application to the eleventh implementation manner of the first aspect of the embodiments of the present application, the twelfth aspect of the first aspect of the embodiments of the present application when there are multiple first time intervals of the first fault propagation path, the first time length is the maximum value or the average value of the multiple first time intervals.
  • a method for processing multiple first durations is limited, which improves the feasibility of the solution.
  • the analysis device sends the target result or the target failure propagation path to the cloud device.
  • a second aspect of the embodiments of the present application provides a data processing method, including: a collection device sends first network data to an analysis device, so that the analysis device processes the first network data to obtain a first fault propagation path, and the first network data includes abnormalities. Information and connections.
  • a third aspect of the embodiments of the present application provides a data processing method, including: a cloud device receives a target result sent by an analysis device, and the target result includes at least one of a target failure propagation path, a target duration, and a target probability.
  • the fourth aspect of the embodiments of the present application provides an analysis device that executes the method of the foregoing first aspect.
  • a fifth aspect of the embodiments of the present application provides a collection device that executes the method of the aforementioned second aspect.
  • the sixth aspect of the embodiments of the present application provides a cloud device that executes the method of the foregoing third aspect.
  • a seventh aspect of the embodiments of the present application provides a computer storage medium that stores instructions in the computer storage medium.
  • the instructions When the instructions are executed on a computer, the computer executes the method of the first aspect described above.
  • the eighth aspect of the embodiments of the present application provides a computer software product.
  • the computer program product When the computer program product is executed on a computer, the computer executes the method of the aforementioned first aspect.
  • Figure 1 is a schematic diagram of a network framework in an embodiment of this application.
  • FIG. 2 is a schematic flow chart of a data processing method in an embodiment of the application
  • FIG. 3 is a schematic diagram of an event node connection diagram in an embodiment of the application.
  • FIG. 4 is a schematic diagram of a fault propagation path in an embodiment of this application.
  • FIG. 5 is a schematic diagram of another fault propagation path in an embodiment of this application.
  • FIG. 6 is a schematic diagram of another flow of a data processing method in an embodiment of the application.
  • FIG. 7 is a schematic diagram of a structure of an analysis device in an embodiment of the application.
  • FIG. 8 is a schematic diagram of another structure of an analysis device in an embodiment of the application.
  • FIG. 9 is a schematic diagram of another structure of the analysis device in an embodiment of the application.
  • the embodiments of the present application provide a data processing method and related equipment, which are more efficient than processing all network data; compared to storing all historical network data, only the historical fault propagation path is stored, which reduces storage costs and has the ability to be replicated. And scalability.
  • the method provided in the embodiments of the present application can be applied to various communication networks, such as a data center network (DCN), a mobile communication network, and the like.
  • the equipment in these communication networks can be connected to the analysis equipment, and the analysis equipment can be used to update or add fault propagation paths that can locate faults in these communication networks. That is, the analysis device used to update or increase the failure propagation path may be a device independent of the communication network.
  • the analysis equipment used to update or increase the fault propagation path can also be the equipment in the communication network, that is, the equipment in the communication network can also be updated or added to the fault propagation that can locate the fault in the communication network. path.
  • Fig. 1 is a schematic diagram of a network architecture in an embodiment of the application.
  • the network architecture in the embodiment of the present application includes: a collection device 101, an analysis device 102, and a cloud device 103.
  • a communication connection is established between a collection device 101 and an analysis device 102.
  • one collection device 101 can also establish a communication connection with two or more analysis devices 102, and one analysis device 102 can also establish a communication connection with two or more analysis devices 102.
  • two or more collection devices 101 establish a communication connection.
  • the collection device 101, the analysis device 102, and the cloud device 103 may be connected through a wired network or through a wireless network. If it is connected via a wired network, the general connection method is an optical fiber network; if it is connected via a wireless network, the general connection method is a wireless fidelity (WiFi) network, or a cellular wireless network, or other types of wireless The internet.
  • WiFi wireless fidelity
  • the main function of the collection device 101 is to collect network data such as fault data and abnormal data in the communication network.
  • network data is provided to the analysis device 102.
  • the main function of the analysis device 102 is to extract, update, and add fault propagation path information.
  • the cloud device 103 is provided with fault propagation path information.
  • the cloud device 103 may be integrated on an operation support system (OSS) to present a summary and updated fault propagation path result.
  • OSS operation support system
  • the analysis device 102 may be a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the cloud device 103 may be a computer, or a server, or a server cluster composed of several servers, or a cloud computing service center, which is deployed at the back end of the service network.
  • the collection device 101 if the collection device 101 integrates a function of updating or adding a fault propagation path, the collection device 101 can be directly connected to the cloud device 103 without updating or adding a fault propagation path through the analysis device 102.
  • an embodiment of the data processing method in the embodiment of the present application includes:
  • the analysis device obtains first network data.
  • the analysis device may obtain the first network data through a network device, or may obtain the first network data through manual input by an operation and maintenance personnel, and the specific acquisition method is not limited here.
  • the network device may be a network device with a collection function, such as a router and a switch.
  • the first network data in the embodiment of the present application is the abnormal information of each node in the first time period and the relationship between each node in the communication network.
  • the abnormal information may be at least one of fault information, alarms, logs, network performance indicators (key performance indicators, KPIs), or other event information.
  • the nodes related to the abnormal event may be physical nodes such as physical devices, boards, and physical ports, or may be such as open shortest path first (OSPF).
  • OSPF open shortest path first
  • Border gateway protocol border gateway protocol, BGP
  • rapid ring protection protocol rapid ring protection protocol
  • VLAN virtual local area network
  • VLAN virtual local area network
  • Event type A form of event type representation The state of the interface has changed IF_STATE Interface is deleted IF_DELETE Neighbor status changes NBR_CHANGE_E The port on the RRPP ring enters the forwarding state PFWD The OSPF neighbor setting interface configuration is inconsistent ospfIfConfigError OSPF neighbor status changes ospfNbrStateChange_active Interface status change linkDown_active The VXLAN tunnel status changes to Down hwNvo3VxlanTnlDown
  • Table 1 is only an example of event types and representation forms. In actual applications, there are other event types or other representation forms, which are not specifically limited here.
  • the analysis device processes the first network data to obtain a first result.
  • the method used by the analysis device to process the first network data to obtain the first result in the embodiment of this application is only schematically illustrated by taking the frequent subgraph mining algorithm as an example. It is understandable that in practical applications, there may be many ways. , Such as graph embedding and clustering techniques, the specifics are not limited here.
  • the frequent subgraph mining algorithm in the embodiment of the present application may be algorithms such as gSpan, CloseGraph, etc., which is not specifically limited here.
  • the analysis device After the analysis device obtains the first network data, it extracts the abnormal event related to the fault and the node related to the abnormal event from the abnormal information. Thus, an event node connection graph is generated according to the relationship between the extracted abnormal event and the nodes related to the abnormal event. As shown in Figure 3, the connection relationship of each node and the abnormal events occurring in each node, Figure 3 is only an example of an event node connection diagram.
  • the event node connection graph may be represented in a graphical form, or may be represented in other forms, for example, may be represented in the form of table items, which is not specifically limited here.
  • the analysis equipment uses the frequent subgraph mining algorithm to extract the common propagation path from the connection graph of all faulty event nodes.
  • the common propagation path is a fault propagation path, which indicates that an abnormal event of one node causes another abnormal event to occur.
  • the fault propagation path is extracted from a plurality of event node connection diagrams similar to Figure 3, and the fault propagation path is One form is shown in FIG. 4, an abnormal event 101 of node 1 causes an abnormal event 102 of node 2 to occur.
  • the path connected by node 1 and node 2 does not contain a node that has a fault alarm, that is, there is no fault-related event on the path connected by each node, and the two nodes are directly connected, which is equivalent to these two
  • the number of hops between nodes is 1. It can be understood that FIG. 4 is only an exemplary illustration, and the number of hops between two nodes may also be an integer greater than 1, which is not specifically limited here.
  • the form of the fault propagation path in the embodiment of the present application may be a visual graphic form, a text form, or other types, which are not specifically limited here.
  • the node 1 in the fault propagation path is a QSPF router (OsRouter), and the node 2 is a QSPF network segment (OsNetwork) in the network node. That is, the abnormal event 101 of the OsRouter will cause the abnormal event 102 of the OsNetwork below it.
  • the text form of the fault propagation path can be expressed as "OsRouter-OsNetwork”.
  • the number of fault propagation paths extracted by the analysis device using the frequent subgraph mining algorithm can be 0 or 1, of course, can also be greater than 1.
  • some event node connection graphs may not extract fault propagation paths, some event node connection graphs may extract fault propagation paths with a number greater than or equal to 1, and multiple event node connection graphs may also extract the same The failure propagation path.
  • the fault propagation path is represented as an abnormal event 101 on node 1 which will cause an abnormal event 102 on node 2 to occur. That is, node 1 represents the physical node that is the root cause of the failure event.
  • the fault propagation path shown in Figure 5 can be expressed as "OsNetwork-L3link-BGPpeer".
  • the fault propagation path is used to indicate that the neighbor protocol status failure (abnormal event) in the OSPF network segment (OsNetwork) causes the IP in the BGP Loopback port to be unavailable. (L3link), eventually leading to BGP peer (BGPpeer) disconnection (abnormal event).
  • the probability and/or duration of the first fault propagation path can also be determined. That is, in addition to the first failure propagation path, the first result may also include the probability and/or duration of the first failure propagation path.
  • the analysis device can determine the probability of the extracted first fault propagation path (hereinafter referred to as the first probability), and can also determine the fault propagation time corresponding to the extracted first fault propagation path (hereinafter referred to as the first time length), and The first probability corresponding to the extracted first fault propagation path and the corresponding first duration can be determined.
  • the first probability is the probability that the occurrence of the second abnormal event at the second node is caused by the first abnormal event at the first node.
  • the analysis device determines the time when the first abnormal event occurs at the first node (that is, the starting point) of the first fault propagation path and the time when the first abnormal event occurs at the second node (that is, the end point) according to the acquired information of the abnormal event in the first network data.
  • the time interval at which the second abnormal event occurs is the first duration.
  • the analysis device determines that the first time length is the first abnormal event (abnormal event 101) of the first node (node 1) and the second The time interval at which the second abnormal event (abnormal event 102) of the node (node 2) occurs.
  • the occurrence time of the first abnormal event at the starting point of the first fault propagation path is 11:25
  • the event occurrence time of the second abnormal event at the end point is 11:26:34
  • the first duration is 1 minute and 34 seconds .
  • the analysis device may determine the first number of occurrences of the first failure propagation path and the third number of occurrences of the second abnormal event at the end of the first failure propagation path according to the event node connection graph.
  • the analysis device determines the first probability in the following way:
  • the analysis device obtains the historical result, and the historical result includes the second result.
  • the analysis device obtains the historical fault propagation path, where any one of the historical fault propagation paths can be called the second fault propagation path, the second fault propagation path is the fault propagation path before the first time period, and the second fault The propagation path indicates that the third abnormal event occurring at the third node causes the fourth abnormal event occurring at the fourth node.
  • the analysis device can also obtain the fault propagation duration of the second fault propagation path (hereinafter referred to as the second duration), and the second number of occurrences of the second fault propagation path before the first time period. And the fourth number of occurrences of the fourth abnormal event on the fourth node in the second fault propagation path before the first time period. That is, the second result can be the second fault propagation path, the second result can also be the second fault propagation path and the second duration, and the second result can also be the second fault propagation path, the second duration, and the second fault propagation. The second number of occurrences of the path and the fourth number of occurrences of the fourth abnormal event at the fourth node in the second fault propagation path.
  • the second result or historical result in the embodiment of this application is obtained by processing historical network data.
  • the processing method can be frequent subgraph mining algorithm, graph embedding, clustering and other technologies. It is understandable that the second result or historical result can be
  • the result of the overlay update can also be the result of processing all the data, and the details are not limited here.
  • the analysis device determines whether the historical fault propagation path includes the same target fault propagation path as the first fault propagation path, if it is included, execute step 205, and if it does not include, execute step 206.
  • the analysis device compares the second fault propagation path in the historical fault propagation path with the first fault propagation path one by one.
  • the analysis device can determine whether the third node and the fourth node in the second fault propagation path are the same as the first node and the second node in the first fault propagation path, and determine whether the third node in the second fault propagation path occurs Whether the third abnormal event and the fourth abnormal event occurring at the fourth node are the same event type as the first abnormal event occurring at the first starting point and the second abnormal event occurring at the second starting point in the first fault propagation path, respectively. If the judgment result is all [For example: the third node of the second fault propagation path has an abnormal event at half past five, causing the fourth node to have an abnormal event at six o'clock.
  • An abnormal event 101 occurred at the first starting point of the first fault propagation path at seven o'clock, causing an abnormal event 102 to occur at the second node at eight o'clock, and the first node and the third node are the same node, and the second node and the fourth node are the same node , And the abnormal event that occurs at the third node at half past five is the same event type as the abnormal event 101, and the abnormal event that occurs at the fourth node at six o'clock is the same event type as the abnormal event 102. It is determined that the historical fault propagation path includes the same target fault propagation path as the first fault propagation path.
  • the event type of abnormal event 101 and the abnormal event that occurred on the third node at half past five is "OSPF network segment neighbor protocol status down”
  • the event type of abnormal event 102 and the abnormal event that occurred on the fourth node at six o'clock is "BGP Loopback port IP is unreachable"]
  • the analysis device can also determine whether the nodes passed by the second fault propagation path are consistent with the nodes passed by the first fault propagation path. Target failure propagation path.
  • the intermediate nodes (nodes other than the start point and the end point) through which the fault propagation path passes may have abnormal events or no abnormal events, which is not limited here.
  • the second fault propagation path and the nodes passing by the first fault propagation path have abnormal events, it can also be judged whether the abnormal events occurred on the nodes passing by the second fault propagation path and the nodes passing by the first fault propagation path. Whether the abnormal event is of the same event type, and if so, it is determined that the historical fault propagation path includes the same target fault propagation path as the first fault propagation path.
  • the analysis device may further determine whether the order of the nodes passed by the second fault propagation path is consistent with the order of the nodes passed by the first fault propagation path.
  • the propagation path of the target failure with the same propagation path.
  • the second fault propagation path that is the same as the first fault propagation path is called the target fault propagation path.
  • step 204 If the judgment in step 204 is yes, that is, when the analysis device determines that the historical fault propagation path includes the same target fault propagation path as the first fault propagation path, the second result is updated. That is, when the historical fault propagation path includes the same target fault propagation path as the first fault propagation path, it means that there is a record before the first time period, and the analysis device updates the second result.
  • the target failure propagation path ie, "OsRouter-OsNetwork” in the historical failure propagation path occurred 150 times before the first time period
  • the first failure propagation path ie, "OsRouter-OsNetwork” in the first time If the number of occurrences in the segment is 10, the analysis device updates the number of "OsRouter-OsNetwork", that is, 150+10. The number of "OsRouter-OsNetwork" is 160 times.
  • the analysis device may calculate the target duration, and update the second duration to the target duration.
  • the target duration may be calculated.
  • OsRouter-OsNetwork as an example, that is, “OsRouter-OsNetwork” contains the second duration in the second result and the first duration in the first result.
  • the first result includes a first fault propagation path, a first duration, and a first number of times and a third number of times.
  • table 2 shows a first fault propagation path, a first duration, and a first number of times and a third number of times.
  • the second result includes the target failure propagation path, the second duration, the second number of times, and the fourth number of times. as shown in Table 3:
  • the analysis device compares the size of the second duration with the first duration, and determines the maximum time interval between the second duration and the first duration as the target duration. That is, the analysis equipment determines that 1 minute is the target duration.
  • the target duration may not be updated, that is, the target duration at this time is the second duration, and there is a previous record.
  • the analysis equipment can calculate the target duration in the following ways:
  • the target duration can also be calculated by other methods, which are not specifically limited here.
  • the historical fault propagation path includes the same target fault propagation path as the first fault propagation path, it indicates that the third abnormal event that occurs at the first node in the target fault propagation path is the same as the first node in the first fault propagation path.
  • the first abnormal event of is the same event type, and the fourth abnormal event that occurs at the second node in the target failure propagation path and the second abnormal event that occurs at the second node in the first failure propagation path are the same event type.
  • the analysis device can also calculate the target probability and update the probability of the target failure propagation path to the target probability.
  • the analysis device can calculate the target probability in the following way.
  • the target probability may also be calculated by other methods, which are not specifically limited here.
  • step 204 determines that the historical fault propagation path does not include a target fault propagation path that is the same as the first fault propagation path
  • the first result is saved. That is, when the historical fault propagation path does not include the same target fault propagation path as the first fault propagation path, it means that there is no record before the first time period, and the analysis device saves the first result.
  • Figure 5 shows the first failure propagation path, that is, the first failure propagation path is "OsNetwork-L3link-BGPpeer". Since the first failure propagation path is not recorded in the second failure propagation path before the first time period, the analysis The device saves the first fault propagation path, that is, adds the first fault propagation path to the record.
  • the analysis device can save the first time.
  • the first result includes a first fault propagation path, a first duration, and a first number of times and a third number of times. As shown in Table 4:
  • the analysis device saves the fault propagation path: the fault propagation time of "OsNetwork-L3link-BGPpeer" is 1 min.
  • the corresponding probability (the first probability) of the first fault propagation path is not recorded before the first time period, analyze The device can calculate or save the first probability.
  • the analysis device can directly save the first probability. If the first probability is not calculated in step 202, the analysis device can calculate the first probability in the same manner as in step 202, which will not be repeated here.
  • Step 203 in the embodiment of the present application may be before step 202 or before step 201, as long as it is before step 204, and step 206 may be before step 205, as long as it is after step 204.
  • step 203 is after step 202, in step 203, the historical result corresponding to the starting point in the first fault propagation path can also be obtained, which can not only reduce unnecessary data transmission, but also improve the comparison efficiency in subsequent judgments.
  • the fault propagation path in the embodiment of the present application can be applied to fault location. Take the fault propagation path shown in Figure 5 as an example for description: "OsNetwork-L3link-BGPpeer", assuming that an abnormal event occurs in the BGPpeer node, according to Figure 5 To find the L3link node connected with BGPpeer, and then query whether the L3link node is connected to the OsNetwork entity node, and detect whether the OsNetwork entity node has an alarm. If there is an alarm, locate the OsNetwork entity node as the root cause of the failure. That is, it is found that the disconnection of the BGP neighbor link is caused by the down state of the neighbor protocol in the OSPF network segment.
  • the analysis device obtains the first result related to the propagation path of the first fault by processing the first network data, and the analysis device obtains the historical result related to the propagation path of the historical fault. If the historical fault propagation path includes the same target fault propagation path as the first fault propagation path, update the number of target fault propagation paths. If the historical fault propagation path does not include the same target fault propagation path as the first fault propagation path, save the first Failure propagation path. To avoid the need to process all acquired historical network data and new network data every time a fault is located to obtain a new fault propagation path.
  • the embodiment of the present application may only process the latest network data, and incrementally update the historical fault propagation path.
  • the storage cost of all historical network data is reduced, and only historical results related to the propagation path of historical faults are saved.
  • the efficiency of fault location is improved, that is, the efficiency of only processing new network data is higher than the efficiency of processing all historical network data and new network data.
  • this embodiment provides the interaction flow between the analysis device and the collection device and the cloud device respectively. Please refer to FIG. 6.
  • Another embodiment of the data processing method in the embodiment of the present application includes:
  • the collection device collects first network data.
  • the collection device collects information about abnormal events related to the failure and the connection relationship of each node in the network.
  • the information of the abnormal event may include at least one of fault information, alarms, KPIs, logs, and the like.
  • connection relationship of each node in the network may be created by the collection device, or may be created by other network devices.
  • the creation process can be as follows: first extract the information of the relevant network entity objects from the abnormal information according to the structural framework (schema) defined by the experts. After parsing, the network object represented by the structured json data is obtained. The json object is entered into the graph database, represented by nodes, and the connection relationship between the objects is established according to the attribute relationship between the objects.
  • this method is only one of many ways to create a connection relationship. It is understandable that in practical applications, a structured data format can also be used to create a connection relationship.
  • the collection device sends the first network data to the analysis device.
  • the collection device After the collection device collects the first network data, it sends the first network data to the analysis device.
  • the analysis device processes the first network data to obtain a first result.
  • the analysis device obtains the historical result, and the historical result includes the second result.
  • Step 603 to step 604 in this embodiment are similar to those described in step 202 to step 203 in the embodiment shown in FIG. 5, and will not be repeated here.
  • the analysis device updates the second result.
  • Step 605 in this embodiment is similar to that described in step 205 in the foregoing embodiment shown in FIG. 5, and will not be repeated here.
  • the analysis device sends the updated second result to the cloud device.
  • the updated second result can be sent to the cloud device.
  • the cloud device can present information such as the summary and updated fault propagation path.
  • the cloud device is integrated in an operation support system (OSS).
  • OSS operation support system
  • the analysis device saves the first result.
  • Step 607 in this embodiment is similar to that described in step 206 in the embodiment shown in FIG. 5, and will not be repeated here.
  • the analysis device sends the first result to the cloud device.
  • the analysis device After the analysis device saves the first result, it can send the first result to the cloud device. After receiving the first result, the cloud device may present and summarize information such as the new fault propagation path.
  • step 606 may be after step 607 or after step 608, as long as it is after step 605.
  • step 607 may be after step 608 or before step 606, which is not specifically limited here.
  • the information interaction between the collection device and the analysis device, and the analysis device and the cloud device is realized, and the transmission of network data (all historical network data) is reduced.
  • the analysis device of the embodiment of the present application can only process the latest network data, and incrementally update the historical fault propagation path, and transmit it to the cloud platform, which can display it.
  • the storage cost of all historical network data is reduced, and only historical results related to the propagation path of historical faults are saved.
  • the efficiency of fault location is improved, that is, the efficiency of only processing new network data is higher than the efficiency of processing all historical network data and new network data.
  • An embodiment of the analysis device in the embodiment of the application includes:
  • the obtaining unit 701 is configured to obtain first network data, where the first network data includes information about abnormal events of multiple nodes in the network during a first time period and the connection relationship of the multiple nodes.
  • the processing unit 702 is configured to process the first network data to obtain a first fault propagation path, where the first fault propagation path indicates that a first abnormal event that occurs at the first node in the first time period causes the second node A second abnormal event occurs, and the first node and the second node are any two different nodes among the multiple nodes.
  • the acquiring unit 701 is also used to acquire historical fault propagation paths.
  • the determining unit 703 is configured to determine whether the historical failure propagation path includes the same target failure propagation path as the first failure propagation path, and the target failure propagation path represents the first node before the first time period.
  • the occurrence of a third abnormal event causes a fourth abnormal event to occur on the second node, the third abnormal event and the first abnormal event are of the same event type, and the fourth abnormal event and the second abnormal event are The same event type.
  • the updating unit 704 is configured to update the number of times of the target failure propagation path when the historical failure propagation path includes the target failure propagation path.
  • the acquiring unit 701 acquires the first network data and the historical fault propagation path, and the processing unit 702 processes the first network data to obtain the first fault propagation path.
  • the update unit 704 updates the number of times the target fault propagation path is processed, which is more efficient than processing all network data; compared to storing all historical network data, only the historical fault propagation path is stored, which reduces storage costs and has Reproducibility and scalability.
  • another embodiment of the analysis device in the embodiment of the present application includes:
  • the acquiring unit 801 is configured to acquire first network data, where the first network data includes information about abnormal events of multiple nodes in the network during a first time period and the connection relationship of the multiple nodes.
  • the processing unit 802 is configured to process the first network data to obtain a first fault propagation path, where the first fault propagation path indicates that a first abnormal event that occurs at the first node in the first time period causes the second node A second abnormal event occurs, and the first node and the second node are any two different nodes among the multiple nodes.
  • the acquiring unit 801 is also used to acquire historical fault propagation paths.
  • the determining unit 803 is configured to determine whether the historical failure propagation path includes the same target failure propagation path as the first failure propagation path, and the target failure propagation path represents the first node before the first time period.
  • the occurrence of a third abnormal event causes a fourth abnormal event to occur on the second node, the third abnormal event and the first abnormal event are of the same event type, and the fourth abnormal event and the second abnormal event are The same event type.
  • the updating unit 804 is configured to update the number of times of the target failure propagation path when the historical failure propagation path includes the target failure propagation path.
  • the saving unit 805 is configured to save the first fault propagation path when the historical fault propagation path does not include the target fault propagation path.
  • the first calculation unit 806 is configured to calculate the first result and the second result to obtain a target duration
  • the second calculation unit 807 is configured to calculate the first result and the second result to obtain the target probability
  • the first calculation unit 806 and the second calculation unit 807 in the embodiment of the present application may also be the same calculation unit, which is not specifically limited here.
  • the processing unit 802 obtains the first result related to the first fault propagation path by processing the first network data, and the obtaining unit 801 then obtains the historical result related to the historical fault propagation path. If the historical fault propagation path includes the same target fault propagation path as the first fault propagation path, the update unit 804 updates the number of target fault propagation paths. If the historical fault propagation path does not include the same target fault propagation path as the first fault propagation path, The saving unit 805 saves the first fault propagation path.
  • FIG. 9 Another embodiment of the analysis device in the embodiment of the present application includes:
  • the analysis device 900 may include one or more processors 901 and a memory 905, and the memory 905 stores one or more application programs or data.
  • the memory 905 may be volatile storage or persistent storage.
  • the program stored in the memory 905 may include one or more modules, and each module may include a series of instruction operations on the analysis device.
  • the processor 901 may be configured to communicate with the memory 905, and execute a series of instruction operations in the memory 905 on the analysis device 900.
  • the analysis device 900 may also include one or more power supplies 902, one or more wired or wireless network interfaces 903, one or more input and output interfaces 904, and/or one or more operating systems, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • operating systems such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the processor 901 can execute the operations performed by the analysis device in the embodiments shown in FIG. 2 and FIG. 6, and details are not described herein again.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请实施例公开了一种数据处理方法,本申请实施例方法可以应用于数据中心网络,分析设备通过获取第一网络数据以及历史故障传播路径,并处理第一网络数据以得到第一故障传播路径,当历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径时,更新目标故障传播路径的次数,相比处理所有网络数据的效率有所提高;相比存储所有历史网络数据,只存储历史故障传播路径,降低存储成本,且具备可复制性和可扩展性。

Description

一种数据处理方法及相关设备
本申请要求于2020年2月14日提交中国专利局、申请号为202010093223.2、发明名称为“一种数据处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及通信技术领域,特别涉及一种数据处理方法及相关设备。
背景技术
网络故障是指由于硬件的问题、软件的漏洞和病毒的侵入等引起网络无法提供正常服务或降低服务质量的状态,如在数据中心网络中,地址解析协议(address resolution protocol,APR)超限、设备重启、路由器的身份标识(identity,ID)冲突等故障都会降低整体网络健康度,影响业务。
现有技术中,在需要确定故障传播路径时,处理新产生的网络数据以及历史所有网络数据得到故障传播路径,该故障传播路径表示故障在网络中被传播的路径。
但是,随着时间的增加,历史数据越来越多,存储成本越来越高,且每次需要确定故障传播路径时,都需要将所有网络数据(历史网络数据以及新产生的网络数据)重新处理,计算效率低。
发明内容
本申请实施例提供了一种数据处理方法及相关设备,相比处理所有网络数据的效率有所提高;相比存储所有历史网络数据,只存储历史故障传播路径,降低存储成本,且具备可复制性和可扩展性。
本申请实施例第一方面提供了一种数据处理方法,包括:分析设备获取第一网络数据,第一网络数据包括网络中多个节点在第一时间段内异常事件的信息以及多个节点的连接关系;分析设备处理第一网络数据以得到第一故障传播路径,第一故障传播路径表示在第一时间段内第一节点发生的第一异常事件引起第二节点发生第二异常事件,第一节点与第二节点为多个节点中任意两个不相同的节点;分析设备获取历史故障传播路径;分析设备判断历史故障传播路径是否包括与第一故障传播路径相同的目标故障传播路径,目标故障传播路径表示在第一时间段之前第一节点发生的第三异常事件引起第二节点发生第四异常事件,第三异常事件与第一异常事件为同一事件类型,第四异常事件与第二异常事件为同一事件类型;当历史故障传播路径包括目标故障传播路径时,分析设备更新目标故障传播路径的次数。
本申请实施例中,分析设备通过获取第一网络数据以及历史故障传播路径,并处理第一网络数据以得到第一故障传播路径,在确定历史故障传播路径包括与第一故障传播路径相同的第二故障传播路径时,更新第二故障传播路径的次数,相比处理所有网络数据的效率有所提高;相比存储所有历史网络数据,只存储历史故障传播路径,降低存储成本,且具备可复制性和可扩展性。
基于本申请实施例第一方面,本申请实施例第一方面的第一种实施方式中,目标故障传播路径所经过的节点与第一故障传播路径所经过的节点相同。
本申请实施例中,当目标故障传播路径所经过的节点与第一故障传播路径所经过的节点相同时,确定目标故障传播路径与第一故障传播路径相同,使得后续故障定位排查更加详细及准确。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式,本申请实施例第一方面的第二种实施方式中,分析设备可以处理第一网络数据以得到第一结果,第一结果包括第一故障传播路径以及第一时长,第一时长由第一时间段内第一故障传播路径中第一节点的第一异常事件的发生时刻与第一故障传播路径中第二节点的第二异常事件的发生时刻之间的第一时间间隔处理得到。分析设备可以获取历史结果,历史结果包括第二结果,第二结果包括目标故障传播路径以及目标故障传播路径对应的第二时长,第二时长由第一时间段之前第三异常事件的发生时刻与第四异常事件的发生时刻之间的第二时间间隔处理得到。分析设备计算第一结果与第二结果以得到目标时长。分析设备将第二时长更新为目标时长。
本申请实施例中,通过增量更新的方式,更新故障传播路径的时长,为后续预测故障的影响时长提供参考。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式以及本申请实施例第一方面的第二种实施方式,本申请实施例第一方面的第三种实施方式中,分析设备将第一时长与第二时长中的最大时长作为目标时长。
本申请实施例中,限定了目标时长为最大时长,提升了方案的可实现性。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第三种实施方式,本申请实施例第一方面的第四种实施方式中,第一结果还可以包括第一次数,第一次数为第一时间段内第一故障传播路径发生的次数;第二结果还可以包括第二次数,第二次数为第一时间段之前目标故障传播路径发生的次数。分析设备通过如下方式计算得到目标时长:
Figure PCTCN2020108424-appb-000001
本申请实施例中,限定了目标时长的一种计算方式,提升了方案的可实现性。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第四种实施方式,本申请实施例第一方面的第五种实施方式中,第一结果还包括第三次数,第三次数为第一时间段内第二节点发生与第二异常事件的事件类型相同的异常事件的次数;第二结果还包括第四次数,第四次数为第一时间段前之前第二节点发生发生与第四异常事件的事件类型相同的异常事件的次数。分析设备计算第一结果与第二结果以得到目标概率。分析设备将目标故障传播路径的概率更新为目标概率。
本申请实施例中,通过通过增量更新的方式,将目标故障传播路径的概率更新为目标概率,有利于提高后续故障根因判断的准确性。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第五种实施方式,本申请实施例第一方面的第六种实施方式中,分析设备 处理第一网络数据以得到第一结果,第一结果包括第一故障传播路径以及第三次数,第三次数为第一时间段内第二节点发生与第二异常事件的事件类型相同的异常事件的次数。分析设备获取历史结果,历史结果包括第二结果,第二结果包括目标故障传播路径以及第四次数,第四次数为第一时间段前之前第二节点发生与第二节点的异常事件相同的次数。分析设备计算第一结果与第二结果以得到目标概率。分析设备将目标故障传播路径的概率更新为目标概率。
本申请实施例中,通过增量更新的方式,将目标故障传播路径的概率更新为目标概率,有利于提高后续故障根因判断的准确性。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第六种实施方式,本申请实施例第一方面的第七种实施方式中,分析设备通过如下方式计算以得到目标概率:
Figure PCTCN2020108424-appb-000002
本申请实施例中,限定了目标概率的一种计算方式,提升了方案的可实现性。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第七种实施方式,本申请实施例第一方面的第八种实施方式中,当历史故障传播路径不包括目标故障传播路径时,分析设备保存第一故障传播路径。
本申请实施例中,在历史没有记录的情况下,保存第一故障传播路径,为后续故障排查提供新的参考。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第八种实施方式,本申请实施例第一方面的第九种实施方式中,分析设备处理第一网络数据以得到第一结果,第一结果包括第一故障传播路径以及第一时长,第一时长由第一时间段内第一故障传播路径中第一节点的第一故障对应的告警发生时刻与第一故障传播路径中第二节点的第二故障对应的告警发生时刻之间的第一时间间隔处理得到。分析设备保存第一时长。
本申请实施例中,在历史没有记录的情况下,保存第一时长,为后续故障排查提供新的参考。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第九种实施方式,本申请实施例第一方面的第十种实施方式中,第一结果包括第一次数以及第三次数,第一次数为第一时间段内第一故障传播路径发生的次数,第三次数为第一时间段内第二节点发生与第二异常事件的事件类型相同的异常事件的次数;第三次数为第一时间段内第二节点发生与第二异常事件的事件类型相同的异常事件的次数。分析设备保存第一故障传播路径的第一概率,第一概率为第二节点发生第二异常事件是由第一节点的第一异常事件引起的概率。
本申请实施例中,在历史没有记录的情况下,保存第一概率,为后续故障排查提供新的参考。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第十种实施方式,本申请实施例第一方面的第十一种实施方式中,分析设 备基于频繁子图挖掘算法处理第一网络数据以得到第一故障传播路径。
本申请实施例中,限定了处理第一网络数据的一种方式,提升了方案的可实现性。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第十一种实施方式,本申请实施例第一方面的第十二种实施方式中,当第一故障传播路径的第一时间间隔为多个时,第一时长为多个第一时间间隔的最大值或平均值。
本申请实施例中,限定了处理多个第一时长的一种方式,提升了方案的可实现性。
基于本申请实施例第一方面、本申请实施例第一方面的的第一种实施方式至本申请实施例第一方面的第十二种实施方式,本申请实施例第一方面的第十三种实施方式中,分析设备向云端设备发送目标结果或者目标故障传播路径。
本申请实施例第二方面提供了一种数据处理方法,包括:采集设备向分析设备发送第一网络数据,以使得分析设备处理第一网络数据得到第一故障传播路径,第一网络数据包括异常信息以及连接关系。
本申请实施例第三方面提供了一种数据处理方法,包括:云端设备接收分析设备发送的目标结果,目标结果包括目标故障传播路径、目标时长以及目标概率中的至少一种。
本申请实施例第四方面提供了一种分析设备,该分析设备执行前述第一方面的方法。
本申请实施例第五方面提供了一种采集设备,该采集设备执行前述第二方面的方法。
本申请实施例第六方面提供了一种云端设备,该云端设备执行前述第三方面的方法。
本申请实施例第七方面提供了一种计算机存储介质,该计算机存储介质中存储有指令,该指令在计算机上执行时,使得计算机执行前述第一方面的方法。
本申请实施例第八方面提供了一种计算机软件产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面的方法。
附图说明
图1为本申请实施例中的网络框架示意图;
图2为本申请实施例中数据处理方法一个流程示意图;
图3为本申请实施例中事件节点连接图的一种示意图;
图4为本申请实施例中一种故障传播路径的示意图;
图5为本申请实施例中另一种故障传播路径的示意图;
图6为本申请实施例中数据处理方法另一流程示意图;
图7为本申请实施例中分析设备一个结构示意图;
图8为本申请实施例中分析设备另一结构示意图;
图9为本申请实施例中分析设备另一结构示意图。
具体实施方式
本申请实施例提供了一种数据处理方法及相关设备,相比处理所有网络数据的效率有所提高;相比存储所有历史网络数据,只存储历史故障传播路径,降低存储成本,且具备 可复制性和可扩展性。
下面将结合各个附图对本申请技术方案的实现原理、具体实施方式及其对应能够达到的有益效果进行详细的阐述。
本申请实施例提供的方法可以应用于各种通信网络中,比如,数据中心网路(data center neteork,DCN)、移动通信网路等。这些通信网络中的设备可以与分析设备连接,进而通过分析设备更新或增加能够对这些通信网络中发生的故障进行定位的故障传播路径。也即是,用于更新或增加故障传播路径的分析设备可以是独立于通信网络之外的设备。当然,用于更新或增加故障传播路径的分析设备也可以为通信网络中的设备,也即是,通过通信网络中的设备也可以更新或增加能够对通信网络中发生的故障进行定位的故障传播路径。
附图1为本申请实施例中的网络架构示意图。请参阅图1,本申请实施例中的网络架构包括:采集设备101、分析设备102以及云端设备103。
本申请实施例中,仅以三个采集设备101、两个分析设备102以及一个云端设备101为例进行说明,在实际应用中,可以有更多或更少的采集设备101以及分析设备102,或更多的云端设备101。
一个采集设备101与一个分析设备102之间建立有通信连接。可选地,为了提高采集设备101与分析设备102之间的通信可靠性,一个采集设备101也可以与两个或两个以上的分析设备102建立通信连接,一个分析设备102也可以与两个或两个以上的采集设备101建立通信连接。
采集设备101、分析设备102以及云端设备103之间可以通过有线网络连接,也可以通过无线网络连接。如果是通过有线网络连接,一般的连接方式为光纤网络;如果是通过无线网络连接,一般的连接方式为无线热点(wireless fidelity,WiFi)网络,或者为蜂窝状无线网络,或者是其他类型的无线网络。
采集设备101的主要功能是采集通信网络中的故障数据和异常数据等网络数据。可选地,向分析设备102提供网络数据。
分析设备102的主要功能是提取、更新以及增加故障传播路径信息。可选地,向云端设备103提供故障传播路径信息。
云端设备103可以集成在运维操作支持系统(operation support systems,OSS)上,呈现汇总更新的故障传播路径结果。
其中,分析设备102,可以是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。云端设备103可以是一台计算机,或者一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心,其部署在服务网络的后端。
本申请实施例中,如果采集设备101集成有更新或增加故障传播路径功能,则采集设备101可以直接与云端设备103连接,而无需通过分析设备102进行更新或增加故障传播路径。
下面结合图1的网络框架,对本申请实施例中的数据处理方法进行描述:
请参阅图2,本申请实施例中数据处理方法一个实施例包括:
201、分析设备获取第一网络数据。
本申请实施例中分析设备可以通过网络设备获取第一网络数据,也可以通过运维人员的手动输入获取第一网络数据,具体获取的方式此处不作限定。其中该网络设备可以是路由器、交换机等具有采集功能的网络设备。
本申请实施例中的第一网络数据为在第一时间段内各节点的异常信息以及通信网络中各节点的关系。其中,异常信息可以为故障信息、告警、日志、网络性能指标(key performance indicator,KPI)或其他事件等信息中至少一种。
由于通信网络中经常会发生不同种类的故障,且不同的故障可能因不同的原因所产生。比如,有的故障是因物理设备的硬件原因所产生,有的故障是因物理设备上部署的协议所产生的。因此,在通信网络发生与故障相关的异常事件时,与该异常事件相关的节点可能是物理设备、单板、物理端口这些物理节点,也可能是诸如开放最短路径优先(open shortest path first,OSPF)协议、边界网关协议(border gateway protocol,BGP)、快速环网保护协议(rapid ring protection protocol,RRPP)以及虚拟局域网(virtual local area network,VLAN)等相关的逻辑节点,还有能是L3link、告警、日志等虚拟节点。
本申请实施例中的事件类型可以有多种,下面对于常见的事件类型进行举例,如表1所示:
表1
事件类型 事件类型表示的一种形式
接口的状态发生变化 IF_STATE
接口被删除 IF_DELETE
邻居状态改变 NBR_CHANGE_E
RRPP环上的端口进入转发状态 PFWD
OSPF建立邻居的接口配置不一致 ospfIfConfigError
OSPF邻居状态发生变化 ospfNbrStateChange_active
接口状态改变 linkDown_active
VXLAN隧道状态变为Down hwNvo3VxlanTnlDown
可以理解的是,表1仅仅是对事件类型以及表示形式的举例,实际应用中,还有其他的事件类型或其他的表示形式,具体此处不作限定。
202、分析设备处理第一网络数据以得到第一结果。
本申请实施例中分析设备处理第一网络数据以得到第一结果所采取的方式仅以频繁子图挖掘算法为例进行示意性说明,可以理解的是,在实际应用中,还可以有很多方式,比如图嵌入以及聚类等技术,具体此处不做限定。
本申请实施例中的频繁子图挖掘算法可以为gSpan、CloseGraph等算法,具体此处不做限定。
分析设备获取第一网络数据之后,从异常信息中提取与故障相关的异常事件,以及与该异常事件相关的节点。从而按照提取出的异常事件与该异常事件相关的节点之间的关系,生成事件节点连接图。如图3所示,各个节点的连接关系以及各个节点所发生的异常事件,图3仅是一种事件节点连接图的举例。
可选地,事件节点连接图可以以图形的形式表示,也可以是用其它形式表示,比如,可以以表项的形式来表示,具体此处不做限定。
分析设备利用频繁子图挖掘算法从所有故障的事件节点连接图中提取出共有传播路径。该共有传播路径为故障传播路径,故障传播路径表示一个节点的异常事件引起另一个发生异常事件,示例性的,从多个类似图3的事件节点连接图中提取故障传播路径,故障传播路径的一种形式如图4所示,节点1的异常事件101引起节点2发生异常事件102。节点1与节点2所连接的路径上不含有发生故障告警的节点,也即是,各个节点所连接的路径上不存在与故障相关的事件,而且这两个节点直接连接,相当于这两个节点之间的跳数为1。可以理解的是,图4只是示例性说明,两个节点之间的跳数也可以为大于1的整数,具体此处不做限定。
本申请实施例中的故障传播路径的形式可以是可视化的图形形式,也可以是文本形式,还可以是其他类型,具体此处不做限定。
示例性的,故障传播路径中的节点1为QSPF路由器(OsRouter),节点2为网络节点中的QSPF网段(OsNetwork)。也即是,OsRouter的异常事件101会导致其下面的OsNetwork发生异常事件102。故障传播路径的文本形式可以表示为“OsRouter-OsNetwork”。
分析设备利用频繁子图挖掘算法提取出的故障传播路径的数量可以为0,也可以为1,当然,也可以大于1。而且,有的事件节点连接图中可能提取不出故障传播路径,有的事件节点连接图中可能提取出数量大于或等于1的故障传播路径,且多个事件节点连接图也可能会提取出相同的故障传播路径。
如图4所示,故障传播路径表示为节点1发生异常事件101会导致节点2发生异常事件102。也即是节点1表示发生故障事件的根因实体节点。
示例性的,图5所示的故障传播路径可以表示为“OsNetwork-L3link-BGPpeer”,该故障传播路径用于指示OSPF网段(OsNetwork)内邻居协议状态故障(异常事件)导致BGPLoopback口中IP不可达(L3link),最终导致BGP邻居(BGPpeer)断链(异常事件)。
进一步地,分析设备确定第一故障传播路径后,还可以确定第一故障传播路径出现的概率和/或时长。也即是,第一结果可以除了第一故障传播路径,还可以包括第一故障传播路径出现的概率和/或时长。分析设备可以确定提取出的第一故障传播路径出现的概率(以下称为第一概率),也可以确定提取出的第一故障传播路径对应的故障传播时长(以下称为第一时长),还可以确定提取出的第一故障传播路径对应的第一概率以及对应的第一时长。其中,第一概率为第二节点发生第二异常事件是由第一节点的第一异常事件引起的概率。
可选地,分析设备根据获取到的第一网络数据中异常事件的信息,确定第一故障传播路径的第一节点(即起点)的第一异常事件发生时刻与第二节点(即终点)的第二异常事件发生时刻的时间间隔为第一时长。以图4表示的故障传播路径是第一故障传播路径为例,也即是,分析设备确定第一时长为第一节点(节点1)的第一异常事件(异常事件101)发生时刻与第二节点(节点2)的第二异常事件(异常事件102)发生时刻的时间间隔。例如,第一故障传播路径的起点的第一异常事件的发生时间为11点25分,终点的第二异常事件的事件发生时间为11点26分34秒,则第一时长为1分34秒。
可选地,分析设备可以根据事件节点连接图确定第一故障传播路径出现的第一次数以及第一故障传播路径中终点发生第二异常事件的第三次数。分析设备通过下面的方式确定第一概率:
Figure PCTCN2020108424-appb-000003
203、分析设备获取历史结果,历史结果包括第二结果。
分析设备获取历史故障传播路径,其中,历史故障传播路径中的任意一个故障传播路径都可以称为第二故障传播路径,第二故障传播路径为第一时间段以前的故障传播路径,第二故障传播路径表示第三节点发生的第三异常事件引起第四节点发生第四异常事件。
进一步地,分析设备获取第二故障传播路径后,还可以获取第二故障传播路径的故障传播时长(以下称为第二时长)、第二故障传播路径在第一时间段以前出现的第二次数以及第二故障传播路径中第四节点在第一时间段以前发生第四异常事件的第四次数。也即是第二结果可以是第二故障传播路径,第二结果也可以是第二故障传播路径和第二时长,第二结果还可以是第二故障传播路径、第二时长、第二故障传播路径出现的第二次数以及第二故障传播路径中第四节点发生第四异常事件的第四次数。
本申请实施例中的第二结果或历史结果由处理历史网络数据得到,处理方式可以为频繁子图挖掘算法、图嵌入以及聚类等技术,可以理解的是,第二结果或历史结果可以是叠加更新的结果,也可以是处理所有数据得到的结果,具体此处不做限定。
204、分析设备判断历史故障传播路径中是否包括与第一故障传播路径相同的目标故障传播路径,若包括,执行步骤205,若不包括,执行步骤206。
分析设备将历史故障传播路径中的第二故障传播路径与第一故障传播路径一一进行对比。
分析设备可以判断第二故障传播路径中的第三节点和第四节点分别与第一故障传播路径中的第一节点和第二节点是否一致,且判断第二故障传播路径中的第三节点发生的第三异常事件和第四节点发生的第四异常事件分别与第一故障传播路径中的第一起点发生的第一异常事件和第二起点发生的第二异常事件是否为同一事件类型。若判断结果都为是【例如:第二故障传播路径的第三节点在五点半发生异常事件引起第四节点在六点发生异常事件。第一故障传播路径的第一起点在七点发生异常事件101引起第二节点在八点发生异常事件102,且第一节点与第三节点为同一节点,第二节点与第四节点为同一节点,且第三节点在五点半发生的异常事件与异常事件101为同一事件类型,第四节点在六点发生的异常事件与异常事件102为同一事件类型。则确定历史故障传播路径中包括与第一故障传播路径相同的目标故障传播路径。例如:异常事件101与第三节点在五点半发生的异常事件的事件类型为“OSPF网段内邻居协议状态down”,异常事件102与第四节点在六点发生的异常事件的事件类型为“BGP Loopback口IP不可达”】,则确定历史故障传播路径中包括与第一故障传播路径相同的目标故障传播路径。
进一步的,分析设备还可以再判断第二故障传播路径所经过的节点与第一故障传播路径所经过的节点是否一致,若一致,则确定历史故障传播路径中包括与第一故障传播路径 相同的目标故障传播路径。
其中,故障传播路径所经过的中间节点(除了起点和终点以外的节点)可以有异常事件发生,也可以没有异常事件发生,此处不作限定。
如果第二故障传播路径与第一故障传播路径所经过的节点有发生异常事件,还可以再判断第二故障传播路径所经过的节点发生的异常事件与第一故障传播路径所经过的节点发生的异常事件是否为同一事件类型,若是,则确定历史故障传播路径中包括与第一故障传播路径相同的目标故障传播路径。
进一步的,分析设备还可以再判断第二故障传播路径所经过的节点的顺序与第一故障传播路径所经过的节点的顺序是否一致,若一致,则确定历史故障传播路径中包括与第一故障传播路径相同的目标故障传播路径。
本申请实施例中,判断历史故障传播路径中是否包括与第一故障传播路径相同的目标故障传播路径的方式有多种,上述方式只是举例具体此处不做限定。
其中,与第一故障传播路径相同的第二故障传播路径称为目标故障传播路径。
205、当历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径时,更新第二结果。
如果步骤204的判断为是,即当分析设备确定历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径时,更新第二结果。即历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径时,说明第一时间段之前之前有记录,则分析设备更新第二结果。
比如,历史故障传播路径中的目标故障传播路径(即“OsRouter-OsNetwork”)在第一时间段之前发生的次数为150次,第一故障传播路径(即“OsRouter-OsNetwork”)在第一时间段内发生的次数为10次,则分析设备更新“OsRouter-OsNetwork”的次数,即150+10。“OsRouter-OsNetwork”的次数为160次。
可选地,当历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径时,分析设备可以计算目标时长,并将第二时长更新为目标时长。以“OsRouter-OsNetwork”为例,也即是,“OsRouter-OsNetwork”在第二结果中含有第二时长,在第一结果中含有第一时长。
示例性的,第一结果包括第一故障传播路径、第一时长以及第一次数和第三次数。如表2所示:
表2
Figure PCTCN2020108424-appb-000004
示例性的,第二结果包括目标故障传播路径、第二时长、第二次数以及第四次数。如表3所示:
表3
Figure PCTCN2020108424-appb-000005
分析设备计算目标时长的方式有多种,下面结合表2和表3的举例进行说明:
1、分析设备比较第二时长与第一时长的大小,确定第二时长与第一时长中的最大时间间隔为目标时长。即分析设备确定1分钟为目标时长。当第二时长比第一时长大时,可以不更新目标时长,即此时的目标时长为第二时长,之前有记录。
2、分析设备通可以过下面的方式计算目标时长:
Figure PCTCN2020108424-appb-000006
即:
Figure PCTCN2020108424-appb-000007
可以理解的是,除了上述两种计算目标时长的方式,还可以通过其他方式计算目标时长,具体此处不做限定。
可选地,当历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径时,说明目标故障传播路径中第一节点发生的第三异常事件与第一故障传播路径中第一节点发生的第一异常事件为同一事件类型,且目标故障传播路径中第二节点发生的第四异常事件与第一故障传播路径中第二节点发生的第二异常事件为同一事件类型。分析设备还可以计算目标概率,并将目标故障传播路径的概率更新为目标概率。
以表2和表3的数据为例,分析设备可以通过下面的方式计算目标概率。
Figure PCTCN2020108424-appb-000008
即:
Figure PCTCN2020108424-appb-000009
可以理解的是,除了上述计算目标概率的方式,还可以通过其他方式计算目标概率,具体此处不做限定。
206、当历史故障传播路径不包括与第一故障传播路径相同的目标故障传播路径时,保存第一结果。
如果步骤204的判断为否,即分析设备确定历史故障传播路径中不包括与第一故障传播路径相同的目标故障传播路径时,保存第一结果。即历史故障传播路径不包括与第一故障传播路径相同的目标故障传播路径时,说明第一时间段之前没有记录,则分析设备保存第一结果。
示例性的,图5表示第一故障传播路径,即第一故障传播路径为“OsNetwork-L3link-BGPpeer”,由于第一时间段之前第二故障传播路径中没有记录第一故障传播路径,则分析设备保存第一故障传播路径,即在记录中增加第一故障传播路径。
可选地,当历史故障传播路径中不包括与第一故障传播路径相同的目标故障传播路径时,则第一故障传播路径对应的故障传播时间(第一时长)在第一时间段之前没有记录,分析设备可以保存第一时长。
示例性的,第一结果包括第一故障传播路径、第一时长以及第一次数和第三次数。如表4所示:
表4
第一故障传播路径 第一次数 第三次数 第一时长
OsNetwork-L3link-BGPpeer 150 400 1min
即,分析设备保存故障传播路径:“OsNetwork-L3link-BGPpeer”的故障传播时间为1min。
可选地,当历史故障传播路径中不包括与第一故障传播路径相同的第二故障传播路径时,则第一故障传播路径对应概率(第一概率)在第一时间段之前没有记录,分析设备可以计算或保存第一概率。
当然,如果步骤202中计算了第一概率,分析设备可以直接保存第一概率。如果步骤202中未计算第一概率,分析设备可以通过如步骤202中计算第一概率的方式计算第一概率,此处不再赘述。
本申请实施例中的步骤203可以在步骤202之前,也可以在步骤201之前,只要在步骤204之前即可,步骤206可以在步骤205之前,只要在步骤204之后即可。
如果步骤203在步骤202之后,步骤203中,也可以针对获取第一故障传播路径中起点对应的历史结果,不仅可以减少没必要的数据传输,还可以提高后续判断中的对比效率。
本申请实施例中的故障传播路径可以应用在故障定位中,以图5所示的故障传播路径为例进行描述:“OsNetwork-L3link-BGPpeer”,假设BGPpeer节点发生异常事件,根据图5所示的故障传播路径,寻找与BGPpeer连接的L3link节点,进而查询L3link节点是否连接OsNetwork实体节点,并检测OsNetwork实体节点是否存在告警。若存在告警,则定位OsNetwork实体节点为故障根因。即,发现BGP邻居断链是由OSPF网段内邻居协议状态down所引起。
当然,可以根据故障传播路径预测哪些网络节点将受到影响,划定网络故障的传播范围。也可以根据故障传播路、概率和传播时长提高后续故障根因推理的效率及准确性。
本申请实施例中,分析设备通过处理第一网络数据以得到与第一故障传播路径相关的第一结果,分析设备再获取历史故障传播路径相关的历史结果。如果历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径,更新目标故障传播路径的次数,如果历史故障传播路径不包括与第一故障传播路径相同的目标故障传播路径,保存第一故障传播路径。避免每次故障定位时,都需要将获取到的所有历史网络数据以及新网络数据进行处理得到新故障传播路径。
本申请实施例可以只处理最新的网络数据,并增量更新历史故障传播路径。一方面降低了所有历史网络数据的存储成本,只保存历史故障传播路径相关的历史结果。另一方面提高了故障定位的效率,即只处理新网络数据的效率比处理所有历史网络数据以及新网络数据的效率有所提高。
基于前面的实施例,本实施例提供了分析设备分别与采集设备、云端设备的交互流程,请参阅图6,本申请实施例中数据处理方法另一实施例包括:
601、采集设备采集第一网络数据。
当通信网络中有节点发生故障时,采集设备采集与故障相关的异常事件的信息以及网络中各节点的连接关系。
异常事件的信息可以包括故障信息、告警、KPI、日志等中至少一种。
示例性的,数据中心网络中会发生不同种类的故障,如设备重启、Routerid冲突等,每种故障都会产生很多告警及日志信息,如Routerid冲突会产生OSPF邻居状态改变、BGP状态机的状态值改变等告警及日志信息。
本申请实施例中,网络中各节点的连接关系可以由采集设备创建,也可以由其他网络设备创建。
创建过程可以是:先根据专家定义的结构框架(schema),从异常信息中抽取相关网络实体对象的信息。再解析得到结构化json数据表示的网络对象。将json对象录入图数据库,用节点来表示,并根据对象之间的属性关系建立对象之间的连接关系。当然,这种方式只是多种创建连接关系的方式中的一种,可以理解的是,在实际应用中,还可以用结构化数据格式等方式创建连接关系。
602、采集设备向分析设备发送第一网络数据。
采集设备采集到第一网络数据后,向分析设备发送该第一网络数据。
603、分析设备处理第一网络数据以得到第一结果。
604、分析设备获取历史结果,历史结果包括第二结果。
本实施例中的步骤603至步骤604与前述图5所示实施例中步骤202至步骤203所描述的类似,此处不再赘述。
605、当历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径时,分析设备更新第二结果。
本实施例中的步骤605与前述图5所示实施例中步骤205所描述的类似,此处不再赘述。
606、分析设备向云端设备发送更新后的第二结果。
分析设备更新第二结果后,可以向云端设备发送更新后的第二结果。云端设备可以在接收更新后的结果后,呈现汇总更新后的故障传播路径等信息。
可选地,云端设备为集成在运营支撑系统(opretions support system,OSS)中。
607、当历史故障传播路径不包括与第一故障传播路径相同的目标故障传播路径时,分析设备保存第一结果。
本实施例中的步骤607与前述图5所示实施例中步骤206所描述的类似,此处不再赘述。
608、分析设备向云端设备发送第一结果。
分析设备保存第一结果后,可以向云端设备发送第一结果。云端设备可以在接收第一结果后,呈现汇总新的故障传播路径等信息。
本实施例中步骤606可以在步骤607之后,也可以在步骤608之后,只要在步骤605之后即可。步骤607可以在步骤608之后,也可以在步骤606之前,具体此处不作限定。
本申请实施例中,实现了采集设备与分析设备、分析设备与云端设备之间的信息交互,并减少网络数据(所有历史网络数据)的传输。本申请实施例分析设备可以只处理最新的网络数据,并增量更新历史故障传播路径,并传输给云端平台,云端平台可以进行展示。 一方面降低了所有历史网络数据的存储成本,只保存历史故障传播路径相关的历史结果。另一方面提高了故障定位的效率,即只处理新网络数据的效率比处理所有历史网络数据以及新网络数据的效率有所提高。
上面对本申请实施例中的数据处理方法进行了描述,下面对本申请实施例中的分析设备进行描述,请参阅图7,本申请实施例中分析设备一个实施例包括:
获取单元701,用于获取第一网络数据,所述第一网络数据包括网络中多个节点在第一时间段内异常事件的信息以及所述多个节点的连接关系。
处理单元702,用于处理所述第一网络数据以得到第一故障传播路径,所述第一故障传播路径表示在所述第一时间段内第一节点发生的第一异常事件引起第二节点发生第二异常事件,所述第一节点与所述第二节点为所述多个节点中任意两个不相同的节点。
所述获取单元701,还用于获取历史故障传播路径。
判断单元703,用于判断所述历史故障传播路径是否包括与所述第一故障传播路径相同的目标故障传播路径,所述目标故障传播路径表示在所述第一时间段之前所述第一节点发生的第三异常事件引起所述第二节点发生第四异常事件,所述第三异常事件与所述第一异常事件为同一事件类型,所述第四异常事件与所述第二异常事件为同一事件类型。
更新单元704,用于当所述历史故障传播路径包括目标故障传播路径时,更新所述目标故障传播路径的次数。
本实施例中,分析设备中各单元所执行的操作与前述图2所示实施例中描述的类似,此处不再赘述。
本实施例中,获取单元701通过获取第一网络数据以及历史故障传播路径,处理单元702处理第一网络数据以得到第一故障传播路径,当历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径时,更新单元704更新目标故障传播路径的次数,相比处理所有网络数据的效率有所提高;相比存储所有历史网络数据,只存储历史故障传播路径,降低存储成本,且具备可复制性和可扩展性。
请参阅图8,本申请实施例中分析设备另一实施例包括:
获取单元801,用于获取第一网络数据,所述第一网络数据包括网络中多个节点在第一时间段内异常事件的信息以及所述多个节点的连接关系。
处理单元802,用于处理所述第一网络数据以得到第一故障传播路径,所述第一故障传播路径表示在所述第一时间段内第一节点发生的第一异常事件引起第二节点发生第二异常事件,所述第一节点与所述第二节点为所述多个节点中任意两个不相同的节点。
所述获取单元801,还用于获取历史故障传播路径。
判断单元803,用于判断所述历史故障传播路径是否包括与所述第一故障传播路径相同的目标故障传播路径,所述目标故障传播路径表示在所述第一时间段之前所述第一节点发生的第三异常事件引起所述第二节点发生第四异常事件,所述第三异常事件与所述第一异常事件为同一事件类型,所述第四异常事件与所述第二异常事件为同一事件类型。
更新单元804,用于当所述历史故障传播路径包括目标故障传播路径时,更新所述目标故障传播路径的次数。
本实施例中的分析设备还包括:
保存单元805,用于当所述历史故障传播路径不包括所述目标故障传播路径时,保存所述第一故障传播路径。
第一计算单元806,用于计算所述第一结果与所述第二结果以得到目标时长;
第二计算单元807,用于计算所述第一结果与所述第二结果以得到目标概率;
本实施例中,分析设备中各单元所执行的操作与前述图2所示实施例中描述的类似,此处不再赘述。
本申请实施例中的第一计算单元806与第二计算单元807也可以是同一个计算单元,具体此处不作限定。
本实施例中,处理单元802通过处理第一网络数据以得到与第一故障传播路径相关的第一结果,获取单元801再获取历史故障传播路径相关的历史结果。如果历史故障传播路径包括与第一故障传播路径相同的目标故障传播路径,更新单元804更新目标故障传播路径的次数,如果历史故障传播路径不包括与第一故障传播路径相同的目标故障传播路径,保存单元805保存第一故障传播路径。避免每次故障定位时,都需要将获取到的所有历史网络数据以及新网络数据进行处理得到新故障传播路径,且通过更新故障传播路径、时长以及概率,为后续故障根因判断提供新的参考数据,有利于提高故障排查效率和准确率。
下面对本申请实施例中的分析设备进行描述,请参阅图9,本申请实施例中分析设备另一实施例包括:
该分析设备900可以包括一个或一个以上处理器901和存储器905,该存储器905中存储有一个或一个以上的应用程序或数据。
其中,存储器905可以是易失性存储或持久存储。存储在存储器905的程序可以包括一个或一个以上模块,每个模块可以包括对分析设备中的一系列指令操作。更进一步地,处理器901可以设置为与存储器905通信,在分析设备900上执行存储器905中的一系列指令操作。
分析设备900还可以包括一个或一个以上电源902,一个或一个以上有线或无线网络接口903,一个或一个以上输入输出接口904,和/或,一个或一个以上操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等。
该处理器901可以执行前述图2与图6所示实施例中分析设备所执行的操作,具体此处不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络 单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (28)

  1. 一种数据处理方法,其特征在于,包括:
    分析设备获取第一网络数据,所述第一网络数据包括网络中多个节点在第一时间段内异常事件的信息以及所述多个节点的连接关系;
    所述分析设备处理所述第一网络数据以得到第一故障传播路径,所述第一故障传播路径表示在所述第一时间段内第一节点发生的第一异常事件引起第二节点发生第二异常事件,所述第一节点与所述第二节点为所述多个节点中任意两个不相同的节点;
    所述分析设备获取历史故障传播路径;
    所述分析设备判断所述历史故障传播路径是否包括与所述第一故障传播路径相同的目标故障传播路径,所述目标故障传播路径表示在所述第一时间段之前所述第一节点发生的第三异常事件引起所述第二节点发生第四异常事件,所述第三异常事件与所述第一异常事件为同一事件类型,所述第四异常事件与所述第二异常事件为同一事件类型;
    当所述历史故障传播路径包括目标故障传播路径时,所述分析设备更新所述目标故障传播路径的次数。
  2. 根据权利要求1所述的数据处理方法,其特征在于,所述目标故障传播路径所经过的节点与所述第一故障传播路径所经过的节点相同。
  3. 根据权利要求1或2所述的数据处理方法,其特征在于,所述分析设备处理所述第一网络数据以得到第一故障传播路径,包括:
    所述分析设备处理所述第一网络数据以得到第一结果,所述第一结果包括第一故障传播路径以及第一时长,所述第一时长由所述第一时间段内所述第一故障传播路径中所述第一节点的所述第一异常事件的发生时刻与所述第一故障传播路径中所述第二节点的所述第二异常事件的发生时刻之间的第一时间间隔处理得到;
    所述分析设备获取历史故障传播路径,包括:
    所述分析设备获取历史结果,所述历史结果包括第二结果,所述第二结果包括所述目标故障传播路径以及所述目标故障传播路径对应的第二时长,所述第二时长由所述第一时间段之前所述第三异常事件的发生时刻与所述第四异常事件的发生时刻之间的第二时间间隔处理得到;
    所述方法还包括:
    所述分析设备计算所述第一结果与所述第二结果以得到目标时长;
    所述分析设备将所述第二时长更新为所述目标时长。
  4. 根据权利要求3所述的数据处理方法,其特征在于,所述分析设备计算所述第一结果与所述第二结果以得到目标时长,包括:
    所述分析设备将所述第一时长与所述第二时长中的最大时长作为所述目标时长。
  5. 根据权利要求3所述的数据处理方法,其特征在于,所述第一结果还包括第一次数,所述第一次数为所述第一时间段内所述第一故障传播路径发生的次数;所述第二结果还包括第二次数,所述第二次数为所述第一时间段之前所述目标故障传播路径发生的次数;
    所述分析设备计算所述第一结果与所述第二结果以得到目标时长,包括:
    所述分析设备通过如下方式计算得到所述目标时长:
    Figure PCTCN2020108424-appb-100001
  6. 根据权利要求3至5中任一项所述的数据处理方法,其特征在于,所述第一结果还包括第三次数,所述第三次数为所述第一时间段内所述第二节点发生与所述第二异常事件的事件类型相同的异常事件的次数;所述第二结果还包括第四次数,所述第四次数为所述第一时间段前之前所述第二节点发生发生与所述第四异常事件的事件类型相同的异常事件的次数;
    所述方法还包括:
    所述分析设备计算所述第一结果与所述第二结果以得到目标概率;
    所述分析设备将所述目标故障传播路径的概率更新为所述目标概率。
  7. 根据权利要求1或2所述的数据处理方法,其特征在于,所述分析设备处理所述第一网络数据以得到第一故障传播路径,包括:
    所述分析设备处理所述第一网络数据以得到第一结果,所述第一结果包括第一故障传播路径以及第三次数,所述第三次数为所述第一时间段内所述第二节点发生与所述第二异常事件的事件类型相同的异常事件的次数;
    所述分析设备获取历史故障传播路径,包括:
    所述分析设备获取历史结果,所述历史结果包括第二结果,所述第二结果包括所述目标故障传播路径以及第四次数,所述第四次数为所述第一时间段前之前所述第二节点发生与所述第二节点的异常事件相同的次数;
    所述方法还包括:
    所述分析设备计算所述第一结果与所述第二结果以得到目标概率;
    所述分析设备将所述目标故障传播路径的概率更新为所述目标概率。
  8. 根据权利要求6或7所述的数据处理方法,其特征在于,所述分析设备计算所述第一结果与所述第二结果以得到目标概率,包括:
    所述分析设备通过如下方式计算以得到目标概率:
    Figure PCTCN2020108424-appb-100002
  9. 根据权利要求1所述的数据处理方法,其特征在于,所述分析设备获取历史故障传播路径之后,所述方法还包括:
    当所述历史故障传播路径不包括所述目标故障传播路径时,所述分析设备保存所述第一故障传播路径。
  10. 根据权利要求9所述的数据处理方法,其特征在于,所述分析设备处理所述第一网络数据以得到第一故障传播路径,包括:
    所述分析设备处理所述第一网络数据以得到第一结果,所述第一结果包括第一故障传播路径以及第一时长,所述第一时长由所述第一时间段内所述第一故障传播路径中所述第一节点的所述第一故障对应的告警发生时刻与所述第一故障传播路径中所述第二节点的所述第二故障对应的告警发生时刻之间的第一时间间隔处理得到;
    所述方法还包括:
    所述分析设备保存所述第一时长。
  11. 根据权利要求9或10所述的数据处理方法,其特征在于,所述第一结果包括第一次数以及第三次数,所述第一次数为所述第一时间段内所述第一故障传播路径发生的次数,所述第三次数为所述第一时间段内所述第二节点发生与所述第二异常事件的事件类型相同的异常事件的次数;所述第三次数为所述第一时间段内所述第二节点发生类似所述第二异常事件的次数;
    所述方法还包括:
    所述分析设备保存所述第一故障传播路径的第一概率,所述第一概率为所述第二节点发生所述第二异常事件是由所述第一节点的所述第一异常事件引起的概率。
  12. 根据权利要求1至11中任一项所述的数据处理方法,其特征在于,所述分析设备处理所述第一网络数据以得到第一故障传播路径,包括:
    所述分析设备基于频繁子图挖掘算法处理所述第一网络数据以得到所述第一故障传播路径。
  13. 根据权利要求3至5中个任一项所述的数据处理方法,其特征在于,当所述第一故障传播路径的所述第一时间间隔为多个时,所述第一时长为所述多个第一时间间隔的最大值或平均值。
  14. 一种分析设备,其特征在于,包括:
    获取单元,用于获取第一网络数据,所述第一网络数据包括网络中多个节点在第一时间段内异常事件的信息以及所述多个节点的连接关系;
    处理单元,用于处理所述第一网络数据以得到第一故障传播路径,所述第一故障传播路径表示在所述第一时间段内第一节点发生的第一异常事件引起第二节点发生第二异常事件,所述第一节点与所述第二节点为所述多个节点中任意两个不相同的节点;
    所述获取单元,还用于获取历史故障传播路径;
    判断单元,用于判断所述历史故障传播路径是否包括与所述第一故障传播路径相同的目标故障传播路径,所述目标故障传播路径表示在所述第一时间段之前所述第一节点发生的第三异常事件引起所述第二节点发生第四异常事件,所述第三异常事件与所述第一异常事件为同一事件类型,所述第四异常事件与所述第二异常事件为同一事件类型;
    更新单元,用于当所述历史故障传播路径包括目标故障传播路径时,更新所述目标故障传播路径的次数。
  15. 根据权利要求14所述的分析设备,其特征在于,所述目标故障传播路径所经过的节点与所述第一故障传播路径所经过的节点相同。
  16. 根据权利要求14或15所述的分析设备,其特征在于,所述处理单元,具体用于处理所述第一网络数据以得到第一结果,所述第一结果包括第一故障传播路径以及第一时长,所述第一时长由所述第一时间段内所述第一故障传播路径中所述第一节点的所述第一异常事件的发生时刻与所述第一故障传播路径中所述第二节点的所述第二异常事件的发生时刻之间的第一时间间隔处理得到;
    所述获取单元,具体用于获取历史结果,所述历史结果包括第二结果,所述第二结果包括所述目标故障传播路径以及所述目标故障传播路径对应的第二时长,所述第二时长由所述第一时间段之前所述第三异常事件的发生时刻与所述第四异常事件的发生时刻之间的第二时间间隔处理得到;
    所述分析设备还包括:
    第一计算单元,用于计算所述第一结果与所述第二结果以得到目标时长;
    所述更新单元,还用于将所述第二时长更新为所述目标时长。
  17. 根据权利要求16所述的分析设备,其特征在于,所述第一计算单元,具体用于将所述第一时长与所述第二时长中的最大时长作为所述目标时长。
  18. 根据权利要求16所述的分析设备,其特征在于,所述第一结果还包括第一次数,所述第一次数为所述第一时间段内所述第一故障传播路径发生的次数;所述第二结果还包括第二次数,所述第二次数为所述第一时间段之前所述目标故障传播路径发生的次数;
    所述第一计算单元,具体用于通过如下方式计算得到所述目标时长:
    Figure PCTCN2020108424-appb-100003
  19. 根据权利要求16至18中任一项所述的分析设备,其特征在于,所述第一结果还包括第三次数,所述第三次数为所述第一时间段内所述第二节点发生与所述第二异常事件的事件类型相同的异常事件的次数;所述第二结果还包括第四次数,所述第四次数为所述第一时间段前之前所述第二节点发生发生与所述第四异常事件的事件类型相同的异常事件的次数;
    所述第一计算单元,还用于计算所述第一结果与所述第二结果以得到目标概率;
    所述更新单元,还用于将所述目标故障传播路径的概率更新为所述目标概率。
  20. 根据权利要求14或15所述的分析设备,其特征在于,所述处理单元,具体用于处理所述第一网络数据以得到第一结果,所述第一结果包括第一故障传播路径以及第三次数,所述第三次数为所述第一时间段内所述第二节点发生与所述第二异常事件的事件类型相同的异常事件的次数;
    所述获取单元,具体用于获取历史结果,所述历史结果包括第二结果,所述第二结果包括所述目标故障传播路径以及第四次数,所述第四次数为所述第一时间段前之前所述第二节点发生与所述第二节点的异常事件相同的次数;
    所述分析设备还包括:
    第二计算单元,用于计算所述第一结果与所述第二结果以得到目标概率;
    所述更新单元,还用于将所述目标故障传播路径的概率更新为所述目标概率。
  21. 根据权利要求19或20所述的分析设备,其特征在于,所述第一计算单元或所述第二计算单元,具体用于通过如下方式计算以得到目标概率:
    Figure PCTCN2020108424-appb-100004
  22. 根据权利要求14所述的分析设备,其特征在于,所述分析设备还包括:
    保存单元,用于当所述历史故障传播路径不包括所述目标故障传播路径时,保存所述 第一故障传播路径。
  23. 根据权利要求22所述的分析设备,其特征在于,所述处理单元,具体用于处理所述第一网络数据以得到第一结果,所述第一结果包括第一故障传播路径以及第一时长,所述第一时长由所述第一时间段内所述第一故障传播路径中所述第一节点的所述第一故障对应的告警发生时刻与所述第一故障传播路径中所述第二节点的所述第二故障对应的告警发生时刻之间的第一时间间隔处理得到;
    所述保存单元,还用于保存所述第一时长。
  24. 根据权利要求22或23所述的分析设备,其特征在于,所述第一结果包括第一次数以及第三次数,所述第一次数为所述第一时间段内所述第一故障传播路径发生的次数,所述第三次数为所述第一时间段内所述第二节点发生与所述第二异常事件的事件类型相同的异常事件的次数;
    所述保存单元,还用于保存所述第一故障传播路径的第一概率,所述第一概率为所述第二节点发生所述第二异常事件是由所述第一节点的所述第一异常事件引起的概率。
  25. 根据权利要求14至24中任一项所述的分析设备,其特征在于,所述处理单元,具体用于基于频繁子图挖掘算法处理所述第一网络数据以得到所述第一故障传播路径。
  26. 根据权利要求16至18中个任一项所述的分析设备,其特征在于,当所述第一故障传播路径的所述第一时间间隔为多个时,所述第一时长为所述多个第一时间间隔的最大值或平均值。
  27. 一种分析设备,其特征在于,包括:
    处理器、存储器、总线、输入输出设备;
    所述处理器与所述存储器、所述输入输出设备相连;
    所述总线分别连接所述处理器、所述存储器以及所述输入输出设备相连;
    所述处理器执行如权利要求1至13中任一项所述的方法。
  28. 一种计算机存储介质,其特征在于,所述计算机存储介质中存储有指令,所述指令在计算机上执行时,使得所述计算机执行如权利要求1至13中任一项所述的方法。
PCT/CN2020/108424 2020-02-14 2020-08-11 一种数据处理方法及相关设备 WO2021159676A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20919283.0A EP4084411A4 (en) 2020-02-14 2020-08-11 DATA PROCESSING METHOD AND ASSOCIATED DEVICE
US17/875,809 US20220376971A1 (en) 2020-02-14 2022-07-28 Data processing method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010093223.2 2020-02-14
CN202010093223.2A CN113271216B (zh) 2020-02-14 2020-02-14 一种数据处理方法及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/875,809 Continuation US20220376971A1 (en) 2020-02-14 2022-07-28 Data processing method and related device

Publications (1)

Publication Number Publication Date
WO2021159676A1 true WO2021159676A1 (zh) 2021-08-19

Family

ID=77227267

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/108424 WO2021159676A1 (zh) 2020-02-14 2020-08-11 一种数据处理方法及相关设备

Country Status (4)

Country Link
US (1) US20220376971A1 (zh)
EP (1) EP4084411A4 (zh)
CN (1) CN113271216B (zh)
WO (1) WO2021159676A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363149B (zh) * 2021-12-23 2023-12-26 上海哔哩哔哩科技有限公司 故障处理方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020019870A1 (en) * 2000-06-29 2002-02-14 International Business Machines Corporation Proactive on-line diagnostics in a manageable network
CN100456687C (zh) * 2003-09-29 2009-01-28 华为技术有限公司 网络故障实时相关性分析方法及系统
CN102640154A (zh) * 2009-07-30 2012-08-15 惠普开发有限公司 基于所接收的与网络实体相关联的事件来构造贝叶斯网络
CN110597726A (zh) * 2019-09-19 2019-12-20 中国商用飞机有限责任公司北京民用飞机技术研究中心 航电系统的安全性管理方法、装置、设备和存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001337143A (ja) * 2000-05-30 2001-12-07 Nec Corp 論理回路における故障箇所推定システム、及び、故障箇所推定方法、並びに、記録媒体
CN105187255B (zh) * 2015-09-29 2018-08-14 华为技术有限公司 故障分析方法、故障分析装置和服务器
CN108322320B (zh) * 2017-01-18 2020-04-28 华为技术有限公司 业务生存性分析方法及装置
US20190286504A1 (en) * 2018-03-15 2019-09-19 Ca, Inc. Graph-based root cause analysis
CN109861858B (zh) * 2019-01-28 2020-06-26 北京大学 微服务系统根因节点的错误排查方法
CN110752952B (zh) * 2019-10-25 2022-02-22 腾讯科技(深圳)有限公司 网络故障定位方法、装置、网络设备及计算机存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020019870A1 (en) * 2000-06-29 2002-02-14 International Business Machines Corporation Proactive on-line diagnostics in a manageable network
CN100456687C (zh) * 2003-09-29 2009-01-28 华为技术有限公司 网络故障实时相关性分析方法及系统
CN102640154A (zh) * 2009-07-30 2012-08-15 惠普开发有限公司 基于所接收的与网络实体相关联的事件来构造贝叶斯网络
CN110597726A (zh) * 2019-09-19 2019-12-20 中国商用飞机有限责任公司北京民用飞机技术研究中心 航电系统的安全性管理方法、装置、设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4084411A4

Also Published As

Publication number Publication date
EP4084411A4 (en) 2023-07-19
CN113271216A (zh) 2021-08-17
EP4084411A1 (en) 2022-11-02
US20220376971A1 (en) 2022-11-24
CN113271216B (zh) 2022-05-17

Similar Documents

Publication Publication Date Title
JP7108674B2 (ja) 故障根本原因決定方法及び装置並びにコンピュータ記憶媒体
CN112787841B (zh) 故障根因定位方法及装置、计算机存储介质
US10108411B2 (en) Systems and methods of constructing a network topology
US10484265B2 (en) Dynamic update of virtual network topology
US20230318906A1 (en) Fault recovery plan determining method, apparatus, and system, and computer storage medium
WO2018036148A1 (zh) 一种服务器集群系统
US7991865B2 (en) Method and system for detecting changes in a network using simple network management protocol polling
CN110659109B (zh) 一种openstack集群虚拟机监控系统及方法
US20220200844A1 (en) Data processing method and apparatus, and computer storage medium
WO2021018309A1 (zh) 报文传输路径确定方法、装置及系统、计算机存储介质
Wang et al. Efficient alarm behavior analytics for telecom networks
CN113973042B (zh) 用于网络问题的根本原因分析的方法和系统
WO2017080161A1 (zh) 云计算中报警信息的处理方法及装置
WO2021147320A1 (zh) 路由异常检测方法、装置及系统、计算机存储介质
EP4024765B1 (en) Method and apparatus for extracting fault propagation condition, and storage medium
WO2016062166A1 (zh) 一种网络的操作管理维护oam方法、装置和系统
WO2021159676A1 (zh) 一种数据处理方法及相关设备
CN114553747A (zh) redis集群的异常检测方法、装置、终端及存储介质
US8489721B1 (en) Method and apparatus for providing high availabilty to service groups within a datacenter
CN115314419B (zh) 一种面向云网络自适应连通性分析方法、系统、设备及存储介质
CN113852487A (zh) 用于促进自愈网络的方法和系统
US20180287855A1 (en) Network Incident Identification Based On Characterizing Relationships Between Interfaces And Events As Graphical Component Relationships
US20220239730A1 (en) Detecting anomalies in a distributed application
CN111130881B (zh) 网络拓扑发现方法及装置
WO2022228062A1 (zh) 网络故障分析方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20919283

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020919283

Country of ref document: EP

Effective date: 20220729

NENP Non-entry into the national phase

Ref country code: DE