WO2023030183A1 - 一种网络故障显示方法及设备 - Google Patents

一种网络故障显示方法及设备 Download PDF

Info

Publication number
WO2023030183A1
WO2023030183A1 PCT/CN2022/115069 CN2022115069W WO2023030183A1 WO 2023030183 A1 WO2023030183 A1 WO 2023030183A1 CN 2022115069 W CN2022115069 W CN 2022115069W WO 2023030183 A1 WO2023030183 A1 WO 2023030183A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
information
root cause
result
fault
Prior art date
Application number
PCT/CN2022/115069
Other languages
English (en)
French (fr)
Inventor
肖欣
谢于明
王晨敏
王俊
张忠刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023030183A1 publication Critical patent/WO2023030183A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing

Definitions

  • the present application relates to the field of network technology, in particular to a network fault display method and device.
  • DCN data center network
  • address resolution protocol address resolution protocol
  • ARP address resolution protocol
  • equipment restarts equipment restarts
  • router identity router identity, RI
  • the front-end interface of the network fault operation and maintenance system often only presents various alarms and topology paths related to the fault. Users need to be familiar with the operation manual of the operation and maintenance tool and the alarm meanings to know the specific faults and causes in the network.
  • This application provides a method and device for displaying network faults, which display the root cause event information and the influence range information of the root cause event in text form, so that users can intuitively understand the abnormal events that occur at a certain device in the network Caused network failures and other warning events.
  • a network fault display method includes: acquiring information of a plurality of alarm events in the network, determining a fault of the network according to the information of the plurality of alarm events, and displaying description information of the fault in a text form.
  • the description information of the fault includes location information of the root cause event, description information of the root cause event, and influence range information of the root cause event.
  • a root cause event is an alarm event among multiple alarm events. Other alarm events in the plurality of alarm events are result events of the root cause event.
  • the influence range information of the root cause event includes information of at least one result event in the result events.
  • a single fault may cause multiple alarm events to be generated by multiple devices.
  • the scheme distinguishes the root cause event and the result event among multiple alarm events, and displays the location information and description information of the root cause event that causes the fault, as well as the scope of influence of the root cause event in text form. Therefore, users do not need to be familiar with the operation manual of the operation and maintenance tool and the meaning of various alarm events, but can intuitively see that an abnormal event at a certain device has caused a network failure and other alarm events. This solution reduces the difficulty of fault O&M and improves the efficiency of fault O&M.
  • the information of the at least one result event includes location information of the at least one result event and description information of the at least one result event.
  • displaying the descriptive information of the fault in text includes displaying the descriptive information of the fault in natural language .
  • the location information of the root cause event and the description information of the root cause event are connected by a verb
  • the description information of the root cause event and the influence range information of the root cause event are connected by a verb.
  • the description information of the root cause event and the location information of the root cause event are connected through a preposition
  • the description information of the root cause event and the influence range information of the root cause event are connected through a verb.
  • the description information of the fault may be "A occurred a, resulting in X", or is "a in A, causing X".
  • the location information of the root cause event indicates the location where the root cause event occurs, and may include attribute information of the entity to which the root cause event belongs.
  • the attribute information of the entity to which the root cause event belongs may be one or more of the name, type, identity (identity, ID) and Internet Protocol (Internet Protocol, IP) address of the entity to which the root cause event belongs.
  • the description information of the root cause event includes the identification of the root cause event and/or the The meaning of root cause events.
  • the identification of the root cause event may be the name of the root cause event or the ID of the root cause event.
  • the meaning of the root cause event may be the semantic description information of the root cause event.
  • the description information of the fault further includes information of at least one cluster group.
  • Each cluster group of the at least one cluster group includes one or more of the outcome events.
  • Clustering the result events to display each result event by category can allow the user to know more clearly which types of alarm events are caused by the root cause event.
  • natural language is used to display information of at least one cluster group.
  • the location information of the result event included in each cluster group in at least one cluster group, and the description information of the result event included in each cluster group in the at least one cluster group are connected by verbs.
  • the result events are clustered according to classification conditions to obtain at least a clustering group.
  • the classification condition includes one or more of the following: the type of the result event, the meaning of the result event, the entity to which the result event belongs, the topological relationship between the entities to which the result event belongs, and the level corresponding to the result event.
  • determining the fault of the network according to the information of multiple alarm events includes: Event information and a chain of evidence for fault location determine the root cause of the event.
  • the fault location evidence chain indicates a causal relationship between multiple alarm events.
  • the fault description information is sent.
  • Other devices such as user equipment, may receive the description information of the fault and display the description information of the fault in text form.
  • a network fault display device in a second aspect, includes an acquisition module, a determination module and a display module.
  • the obtaining module is used for obtaining information of multiple alarm events in the network.
  • the determining module is configured to determine a fault of the network according to the information of the plurality of alarm events.
  • the display module is used to display the description information of the fault in text form.
  • the description information of the fault includes location information of the root cause event, description information of the root cause event, and influence range information of the root cause event.
  • the root cause event is an alarm event among the plurality of alarm events. Other alarm events in the plurality of alarm events are result events of the root cause event.
  • the influence range information of the root cause event includes information of at least one result event in the result events.
  • the information of the at least one result event includes location information of the at least one result event and description information of the at least one result event.
  • the display module is configured to use natural language to display fault description information.
  • the location information of the root cause event and the description information of the root cause event are connected by a verb, and the description information of the root cause event and the influence range information of the root cause event are connected by a verb.
  • the description information of the root cause event and the location information of the root cause event are connected through a preposition, and the description information of the root cause event and the influence range information of the root cause event are connected through a verb.
  • the description information of the root cause event includes the identification of the root cause event and/or the root cause meaning of the event.
  • the description information of the fault further includes information of at least one cluster group.
  • Each cluster group of the at least one cluster group includes one or more of the outcome events.
  • the display module is further configured to use natural language to display information of at least one cluster group.
  • the location information of the result event included in each cluster group in at least one cluster group, and the description information of the result event included in each cluster group in the at least one cluster group are connected by verbs.
  • the display module is configured to classify the result event based on the classification condition Cluster to obtain at least one cluster group.
  • the classification condition includes one or more of the following: the type of the result event, the meaning of the result event, the entity to which the result event belongs, the topological relationship between the entities to which the result event belongs, and the level corresponding to the result event.
  • the determination module determines the fault of the network according to the information of the plurality of alarm events Including: the determination module determines the root cause event according to the information of multiple alarm events and the evidence chain of fault location. Wherein, the fault location evidence chain indicates a causal relationship between multiple alarm events.
  • the network fault display device further includes a sending module.
  • the sending module is used for sending description information of the fault.
  • a network fault display device in a third aspect, includes a processor and a memory.
  • the processor is configured to execute the computer program stored in the memory to implement the network fault display method provided by the foregoing first aspect or any possible implementation manner of the first aspect.
  • a computer-readable storage medium is provided. Instructions are stored in the computer-readable storage medium. When the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect or any possible implementation of the first aspect. Provided network fault display method.
  • a computer program product including instructions, which, when run on a computer, causes the computer to execute the network fault display method provided in the first aspect or any possible implementation manner of the first aspect. .
  • FIG. 1 is a schematic diagram of a data center network
  • FIG. 2 is a flow chart of a network fault display method provided in an embodiment of the present application.
  • FIG. 3a is a schematic diagram of the relationship in a log of RI conflicts in a network device provided by an embodiment of the present application
  • FIG. 3b is a schematic diagram of entity information in a log of RI conflicts in a network device provided by an embodiment of the present application
  • FIG. 3c is a schematic diagram of alarm information in a log of RI conflicts in a network device provided by an embodiment of the present application.
  • Fig. 4a is a schematic diagram of description information of a kind of fault provided by the embodiment of the present application.
  • Fig. 4b is a schematic diagram of another fault description information provided by the embodiment of the present application.
  • Fig. 4c is a schematic diagram of another fault description information provided by the embodiment of the present application.
  • Fig. 4d is a schematic diagram of another fault description information provided by the embodiment of the present application.
  • FIG. 5 is a flow chart of another network fault display method provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a network fault display device provided in an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of another network fault display device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of another network fault display device provided by an embodiment of the present application.
  • the term "and/or" is only an association relationship describing associated objects, indicating that there may be three relationships, for example, A and/or B may indicate: A exists alone, A exists alone There is B, and there are three cases of A and B at the same time.
  • the term "plurality" means two or more. For example, multiple systems refer to two or more systems, and multiple screen terminals refer to two or more screen terminals.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.
  • FIG. 1 is a schematic diagram of a topology structure of a data center network.
  • the data center network adopts a leaf-spine network architecture, which specifically includes: at least one spine (spine) switch, at least one leaf (leaf) switch, and at least one server (server).
  • the at least one spine switch includes spine switch 101 and spine switch 102
  • the at least one leaf switch includes leaf switches 111-114
  • the at least one server includes servers 121-124.
  • the spine switch may be an aggregation switch
  • the leaf switch may be an access switch
  • the spine switch and the leaf switch are in a full connection relationship.
  • a leaf switch connects at least one server.
  • the server may provide various types of services, for example, web page service, video service and storage service.
  • the data center network shown in FIG. 1 is only an application scenario diagram of the embodiment of the present application.
  • the embodiments of the present application may also be applied to other networks, for example, enterprise campus networks, campus networks, and operator networks.
  • the operator's network may be an operator's access network, an operator's transmission network, or an operator's end-to-end network including an access network and a transmission network.
  • multiple devices on the network may generate multiple alarms. For example, if the interface connecting the leaf switch 111 described in FIG. 1 to the spine switch 101 is disconnected, the leaf switch 111 will generate alarm events such as interface disconnection and protocol exception, and the spine switch 101 will also generate interface disconnection and protocol exception. and other alarm events, the server 121 will generate alarm events such as service disconnection and transmission control protocol disconnection. If the fault operation and maintenance interface only displays the multiple alarm events, the operation and maintenance personnel need to click on each alarm event one by one and learn the specific meaning of each alarm event to infer the correlation between the alarm events, so as to obtain the fault. specific reason. This increases the difficulty of fault O&M and reduces the efficiency of handling network faults.
  • an embodiment of the present application provides a network fault display method.
  • the network fault display device acquires information of multiple alarm events in the network, determines the fault of the network according to the information of the multiple alarm events, and displays the description information of the fault in text form.
  • the description information of the fault includes location information of the root cause event, description information of the root cause event, and influence range information of the root cause event.
  • the root cause event is an alarm event among multiple alarm events. Other alarm events in the plurality of alarm events are result events of the root cause event.
  • the influence range information of the root cause event includes information of at least one result event in the result events. That is, by viewing the fault description information, the user can clearly understand: which abnormal event occurred at which location caused the network fault and caused other alarm events to be generated. Please refer to the following description for the detailed solutions of the embodiments of the present application.
  • Fig. 2 is a flow chart of a network fault display method provided by an embodiment of the present application. The method is applied to network fault display equipment. As shown in FIG. 2, the method includes the following steps S201-S203.
  • Step S201 acquiring information on multiple alarm events in the network.
  • Alarm events can be various types of events.
  • the alarm event may be an abnormal performance event (for example, the delay of a service exceeds a threshold, or the packet loss rate of a network exceeds a threshold).
  • the alarm event may be an abnormal state event of the device (for example, an interface of the device is disconnected, or a protocol running on the device is in an error state).
  • the alarm event may be an abnormal event recorded in a device log.
  • the network fault display device can obtain information about multiple alarm events from each network device.
  • the network fault display device may send a collection instruction to obtain alarm event information from each network device or store a log of alarm event information.
  • the network device may also periodically send alarm event information or store a log of alarm event information to the network fault display device according to a set period, such as step S200 shown in FIG. 2 .
  • the network device may also send the information of the alarm event to the network fault display device when detecting the alarm event.
  • Network devices may be various types of devices, for example, servers and switches in the data center network shown in FIG. 1 .
  • the information of the alarm event may include: one or more of connection relationship between entities, association relationship between the alarm event and the entity to which the alarm event belongs, attribute information of the entity to which the alarm event belongs, and attribute information of the alarm event.
  • the entity may be a network device, an interface of the network device, or a protocol module in the network device.
  • an entity-to-entity connection relationship represents a direct connection between two entities.
  • entity 1 and entity 2 are network devices
  • the connection relationship between entity 1 and entity 2 indicates that entity 1 and entity 2 are directly connected.
  • entity 3 is an interface
  • the connection relationship between entity 1 and entity 3 indicates that entity 3 is an interface of entity 1 .
  • the connection relationship between entities can be expressed by associating the identifiers of the two entities.
  • the identification of the entity may be the ID of the entity.
  • the first set of "srcResId" and "tarResId" in the "edges" class indicates the entity-to-entity connection relationship.
  • srcResId indicates the ID of the source entity, specifically "83b97583-b7a9-4e40-a149-de1c4ec9674f”
  • tarResId indicates the ID of the target entity, specifically "f31d3ce9-4fa0-3276-914f-b19821e00b7a”.
  • the entity with ID f31d3ce9-4fa0-3276-914f-b19821e00b7a is an interface
  • the entity with ID 83b97583-b7a9-4e40-a149-de1c4ec9674f is a network device. That is, the first group of information indicates that the interface with the ID of f31d3ce9-4fa0-3276-914f-b19821e00b7 is an interface of the network device with the ID of 83b97583-b7a9-4e40-a149-de1c4ec9674f.
  • the association relationship between the alarm event and the entity to which the alarm event belongs indicates that the alarm event has occurred in the entity.
  • the association relationship between the alarm event and the entity to which the alarm event belongs may be indicated by associating the identifier of the alarm event with the identifier of the entity to which the alarm event belongs.
  • the identifier of the alarm event may be an ID of the alarm event.
  • the second group of "srcResId" and “tarResId" in the "edges" class indicates the association relationship between the alarm event and the entity to which the alarm event belongs.
  • srcResId indicates the ID of the alarm event, specifically "346705413275670”
  • tarResId indicates the ID of the entity to which the alarm event belongs, specifically "f31d3ce9-4fa0-3276-914f-b19821e00b7a”. That is, the second group of information indicates that the entity with the ID of f31d3ce9-4fa0-3276-914f-b19821e00b7a has an alarm event with the ID of 346705413275670.
  • the attribute information of the entity may include: one or more types, name, IP address and ID of the entity.
  • the type of entity with ID f31d3ce9-4fa0-3276-914f-b19821e00b7a is "Interface”
  • the name is "10GE3/0/10”
  • the IP address is "192.168.7.2”.
  • the type and name of the entity it can be determined that the entity is an interface of a network device. It can be known from Fig. 3a that an alarm event with ID 346705413275670 occurred on this interface.
  • the entity with ID 83b97583-b7a9-4e40-a149-de1c4ec9674f is of type "NetworkElement", name is “DC1-spine-01”, and IP address is "192.84.21.109". It can be known from Fig. 3a that the network device includes an interface with an ID of f31d3ce9-4fa0-3276-914f-b19821e00b7a.
  • the attribute information of the alarm event may include: one or more of the name of the alarm event, semantic description information, ID, level, and location information.
  • the location information indicates the location where the alarm event occurs, and may include attribute information of the entity to which the alarm event belongs.
  • the attribute information of the entity to which the alarm event belongs may be the name, type, ID, and IP address of the entity to which the alarm event belongs. As shown in Figure 3c, the name of the alarm event is "conflictrouteridintf", the level of the alarm event is "2”, the ID of the alarm event is "346705413275670", and the semantic description information of the alarm event is "OSPF router ID conflict is detected on the interface ".
  • the type of the entity to which the alarm event belongs is "Interface"
  • the name of the entity to which the alarm event belongs is "DC1-spine-01”
  • the ID of the entity to which the alarm event belongs is "f31d3ce9-4fa0-3276-914f-b19821e00b7a”
  • IP address "192.168.7.2” of the entity to which the alarm event belongs.
  • Step S202 determining a network fault according to information of multiple alarm events.
  • the network fault display device may analyze and summarize the multiple alarm events based on the information of the multiple alarm events, so as to determine the fault of the network.
  • the network fault display device may use the fault location evidence chain to determine a root cause event among multiple alarm events, and a result event caused by the root cause event.
  • the fault location evidence chain indicates a causal relationship between multiple alarm events, and may include information about one or more result events and information about a root cause event that causes the one or more result events to occur.
  • the network fault display device can compare the obtained information of the alarm event with the information of each alarm event in the fault location evidence chain, so as to determine the root cause event and the result event in the multiple obtained alarm events.
  • the fault location evidence chain may be "information of event L1 -> information of event L2 -> information of event L3 -> information of event L4".
  • the event L1 is the root cause event in the fault location evidence chain
  • the events L2-L4 are the result events caused by the event L1.
  • the event L2 is also the causal event of the event L3
  • the event L3 is the causal event of the event L4.
  • the information of the events L1-L4 may include the information of the alarm events listed in the aforementioned step S201.
  • the network fault display device can compare the information of the alarm event with the information of the fault location evidence chain event L1-event L4. When the information of the alarm event is consistent with the information of the event L1, the alarm event is the root cause event. When the information of the alarm event is consistent with the information of the event L2 (or L3, or L4), the alarm event is a result event.
  • the fault location evidence chain may also include only one piece of information about the root cause event and one piece of information about the result event caused by the root cause event.
  • the fault location evidence chain may include the name of the root cause event and the names of the result events caused by the root cause event. Taking the above RI conflict fault as an example, the fault location evidence chain represented by the event name can be as follows:
  • the name OSPFRouterIDConflict of the first event indicates the name of the root cause event
  • the three names after OSPFRouterIDConflict all indicate the name of the result event caused by the root cause event.
  • the result event represented by the second event name OSPFPeerDisConnect is also the cause event of the result event represented by the third event name OSPFPeerDisConnect.
  • the fault location evidence chain may only include the name of the entity to which the root cause event belongs and the name of the entity to which the result event caused by the root cause event belongs.
  • the fault location evidence chain represented by entity names can include the following four items:
  • the first entity name represents the name of the entity to which the root cause event belongs
  • the entity name after the first entity name represents the name of the entity to which the result event belongs.
  • the fault location evidence chain represented by a single event information is not limited to the two types shown above in this application, and may also be represented by other information of an alarm event.
  • the information of this category of two alarm events may be the same, but it does not mean that the two events are the same event, and their other information will be different. Therefore, when analyzing event information using a single information fault location evidence chain, it can be determined whether an alarm event is a root cause event or a result event in combination with information from multiple dimensions corresponding to the fault location evidence chain.
  • the comparison may be made from the perspective of the IP address of the entity of the alarm event.
  • Step S203 displaying description information of the fault in text form.
  • the description information of the fault includes: location information of the root cause event, description information of the root cause event, and influence range information of the root cause event.
  • the influence range information of the root cause event includes information of at least one result event in the result events.
  • the information of the at least one result event may include location information of the at least one result event and description information of the at least one result event.
  • the description information of an event (including a root cause event and a result event) may include an identifier of the event and/or a meaning of the event.
  • the event identifier can be the name or ID of the event.
  • the meaning of the event may be the semantic description information of the event.
  • the network fault display device can display fault description information in natural language.
  • the network fault display device may use connecting words to textually connect the location information of the root cause event, the description information of the root cause event, and the influence range information of the root cause event, so as to obtain the description information of the fault.
  • the conjunctions can be verbs and prepositions and other types of words.
  • the network fault display device may use the first verb to connect the location information of the root cause event and the description information of the root cause event to indicate that the root cause event occurred at the location.
  • the network fault display device may use the second verb to connect the description information of the root cause event and the influence range information of the root cause event to indicate the influence range of the root cause event.
  • the first verb can be "happen”, “exist”, “generate”, “appear”, etc.
  • the second verb can be "cause", "cause”, “cause” and so on.
  • the network fault display device can also use the first verb to connect the location information of the result event and the description information of the result event in the influence range information of the root cause event.
  • the description information of the fault can be "A happened to a, which caused B to happen to b", or "the existence of A caused B to happen to b", or "A happened to a, which caused B to happen to b", or "A happened to A, which caused B to happen to b", or "A a occurs, causing B to produce b". It can be understood that the fault description information in this application is not limited to the verbs listed in this application.
  • the network fault display device may also use the first preposition to connect the description information of the root cause event and the location information of the root cause event, so as to indicate that the root cause event occurs at the location.
  • the network fault display device can use the second verb to connect the description information and the influence scope information of the root cause event.
  • the first preposition can be "in” etc.
  • the network fault display device may also use the first verb or the first preposition to connect the location information and description information of the result event in the influence range information of the root cause event.
  • the description information of the fault can be "a in A causes b in B", or "a in A causes b in B", or "a in A causes b in B", or "a in A causes b in B", Or "a in A causes B to happen to b" etc.
  • the description information of the fault may specifically be "hwOspfv2IntraAreaRouteridConflict occurred on serverleaf02_1, causing bgpBackwardTransition_active to occur on DC1-spine-01".
  • the fault description can also be "hwOspfv2IntraAreaRouteridConflict in serverleaf02_1, causing bgpBackwardTransition_active to occur in DC1-spine-01".
  • serverleaf02_1 is the location information of the root cause event
  • hwOspfv2IntraAreaRouteridConflict is the description information of the root cause event
  • DC1-spine-01 is the location information of the result event caused by the root cause event
  • bgpBackwardTransition_active is the description information of the result event.
  • the influence range information of the root cause event may include information of the multiple result events.
  • the description information of the fault may be "a occurs in A, causing b to occur in B, and c to occur in C".
  • b and c are description information of two result events caused by root cause event a.
  • the network fault display device may also include the number of result events caused by the root cause event in the influence range information of the root cause event.
  • the description information of the fault may be "a occurs in A, causing M abnormalities such as b to occur in B", or the description information of the fault may be "a occurs in A, causing M abnormalities to occur in B, C, etc.”.
  • M represents the number of consequence events caused by the root cause event.
  • the description information of the fault can also be: "hwOspfv2IntraAreaRouteridConflict occurred on serverleaf02_1, resulting in two abnormalities such as bgpBackwardTransition_active occurring on DC1-spine-01".
  • the number of consequence events caused by the root cause event is 2.
  • the network fault display device can also select a result event from the multiple result events as the first event, so as to use the position of the first event in the description information of the fault
  • the information and description information serve as the influence scope information of the root cause event.
  • a network fault display device can select the resulting event with a higher impact. For example, B and b in the fault description information "A occurs a, which causes B to occur b" may be replaced with the location information of the first event and the description information of the first event.
  • the network fault display device may select the result event closest to the application layer among the multiple result events as the first event.
  • the network fault display device may use the gap between the level of the entity to which the result event belongs in the communication model and the application layer as a criterion for judging the degree to which the result event is close to the application layer.
  • the layers of the communication model include application layer, presentation layer, session layer, transport layer, network layer, data link layer and physical layer.
  • the entities to which the resulting events belong may correspond to the levels of the OSI reference model.
  • the entity to which the result event belongs is an interface of a network device, it may correspond to the physical layer; if the entity to which the result event belongs is a media access control (media access control, MAC) protocol module, then it may correspond to a data link Route layer; if the entity described in the result event is a routing information protocol (routing information protocol, RIP) protocol module, it can correspond to the network layer; if the entity to which the result event belongs is a transmission control protocol (transport control protocol, TCP) module , then it can correspond to the transport layer.
  • TCP transmission control protocol
  • result event B is closest to the application layer, and the network fault display device can Select result event B as the first event.
  • the description information of the fault may only include location information of the root cause event, an identifier of the root cause event, and a meaning of the root cause event.
  • the network fault display device can use verbs to connect the location information of the root cause event and the identification of the root cause event, and use prepositions to connect the identification of the root cause event and the meaning of the root cause event.
  • the meaning of the root cause event may be the semantic description information of the root cause event. For example, if the location information of the root cause event is A, the identification of the root cause event is a, and the meaning of the root cause event is X, then the description information of the fault may be "A occurred a, that is, X".
  • the fault description can be "hwOspfv2IntraAreaRouteridConflict occurred in domain 0.0.0.0 of serverleaf02_1OSPF process 2, that is, OSPF detected a router ID conflict within the area.”
  • serverleaf02_1OSPF process 2 domain 0.0.0.0 is the location where the root cause event occurred
  • hwOspfv2IntraAreaRouteridConflict is the ID of the root cause event
  • OSPF detected a router ID conflict in the area is the meaning of the root cause event.
  • the network fault display device may also display more detailed information about the resulting event caused by the root cause event.
  • the network fault display device clusters multiple result events, and the description information of the fault may further include information of at least one clustering group.
  • Each of the at least one cluster group includes one or more outcome events.
  • the network fault display device can use natural language to display information of at least one cluster group.
  • the position information of the result event included in each cluster group in at least one cluster group, and the description information of the result event included in each cluster group in the at least one cluster group are connected by a verb.
  • the network fault display device can classify the multiple result events according to the classification condition, so as to divide the multiple result events into at least one clustering group.
  • the classification condition may include the type of the result event, the meaning of the result event, the entity to which the result event belongs, the topological relationship between the entities to which the result event belongs, the level corresponding to the result event, and the like. where each cluster group includes similar outcome events.
  • the network fault display device can identify the 8 result events based on the classification conditions. Events are clustered to obtain multiple cluster groups. Among them, the seven result events are:
  • Result event 1 "routing protocol error for network device 0"
  • Result event 2 "TCP connection to network device 0 disconnected"
  • Result event 4 "Routing protocol error for network device 1"
  • Result event 8 "TCP connection to network device 2 disconnected”.
  • the network fault display device clusters the result events according to the types of the result events, and can obtain 3 cluster groups (for example, cluster groups A to C).
  • cluster group A may correspond to result events of the "interface disconnection” category, including result event 0, result event 3 and result event 5.
  • cluster group B may correspond to result events of the "protocol error” category, including result event 1, result event 4, and result event 6.
  • cluster group C may correspond to result events of the "TCP connection disconnected" category, including result event 2 and result event 8 .
  • the network failure display device performs clustering according to the meaning of the result event, and the result event 6 and the result event 7 may be clustered into a cluster group, for example, cluster group D.
  • the network fault display device clusters the result event according to the entity to which the result event belongs, and can obtain three cluster groups (for example, cluster groups E ⁇ G).
  • clustering group E includes result events 0-2, and the three result events all belong to network device 0.
  • the clustering group F includes result events 3-4, both of which belong to the network device 1 .
  • the clustering group G includes result events 5-8, and these 4 result events all belong to the network device 2 .
  • the network fault display device clusters the result events according to the topological relationship between the entities to which the result events belong, and can obtain one or more clustering groups.
  • the topological relationship may be the number of hops between entities.
  • the network fault display device can obtain two clustering groups (for example, clustering groups H to I ).
  • cluster group H includes result events 0-4, and cluster group I includes result events 5-8.
  • the network fault display device clusters the result events according to the levels corresponding to the result events, and can obtain 3 cluster groups (for example, cluster groups J ⁇ L).
  • clustering group J includes result event 0, result event 3 and result event 5, and the level corresponding to the three result events is the physical layer.
  • clustering group K includes result event 1, result event 4, result event 6 and result event 7, and the level corresponding to these four result events is the network layer.
  • the clustering group L includes result event 2 and result event 8, and the level corresponding to these three result events is the TCP layer.
  • the network fault display device may include information of the above one or more clustering groups in the description information of the fault.
  • the information of a cluster group can be obtained by processing the information of the result events included in a cluster group by the network fault display device according to the natural language generation model.
  • the network fault display device can also extract the location information, identification, meaning, etc. of each result event to merge the location information of the result events included in the cluster group in each cluster group, and use verbs to connect the merged location information with each The identification and/or meaning of the resulting event.
  • the fault description information may also include the following information:
  • the following event occurs to two entities such as network device 0 and 2: the TCP connection is disconnected.
  • Routing protocol error The following events occurred to 3 entities including network devices 0 to 3: Routing protocol error.
  • routing protocol error RIP error.
  • the fault description information includes root cause location information, root cause description information, and root cause influence range information (for example, "hwOspfv2IntraAreaRouteridConflict occurred on serverleaf02_1, causing bgpBackwardTransition_active to occur on DC 1-spine-01", or " The hwOspfv2IntraAreaRouteridConflict on serverleaf02_1 caused bgpBackwardTransition_active to occur on DC1-spine-01").
  • the description information of the fault may also include information of multiple clustering groups. For example, the fault description information is shown in Figure 4d, which also includes information of five cluster groups:
  • OSPF neighbor status changes which may be due to changes in the status of the interface where the neighbor is located, or received The content of the Hello packet changes; an OSPF packet is retransmitted on a non-virtual link interface, possibly due to a physical link failure;
  • the first clustering group includes the result events obtained by clustering with the topological relationship between the entities to which the result events belong as the classification condition.
  • serverleaf02_1 and “serverleaf03_1” can be two directly connected network devices.
  • the second clustering group and the third clustering group include the result events obtained by clustering with the meaning of the result event as the classification condition.
  • the second clustering group includes the result events related to the bidirectional forwarding detection (Bidirectional Forwarding Detection, BFD) state change
  • the third clustering group includes the result events related to the Open Shortest Path First (Open Shortest Path First, OSPF) protocol .
  • the fourth clustering group includes result events obtained by clustering with the entity to which the result event belongs as the classification condition
  • the fifth cluster group includes the result events obtained by clustering with the type of the result event as the classification condition.
  • the circles, boxes, and arrows in the above-mentioned Figures 4a to 4d are used to explicitly mark the location information, description information, impact range information of the root cause event included in the fault description information, and the connecting words between various information to facilitate the application. readers understand this application. It can be understood that the circles, boxes, arrows, etc. in the above-mentioned Figures 4a to 4d are not information that must be included in the fault description information.
  • the network fault display device can present fault description information to the user through its own display interface.
  • the fault description information includes location information of the root cause event, description information, and influence range information of the root cause event. Users do not need to be familiar with the operation manual of the operation and maintenance tool and the meaning of various alarm events, but can intuitively see through the display interface that an abnormal event at a certain device has caused a network failure and other alarm events.
  • the network fault display device also classifies the resulting events, making the presentation of multiple alarm events clearer. This solution reduces the difficulty of fault O&M and improves the efficiency of fault O&M.
  • Fig. 5 is a flow chart of another network fault display method provided by the embodiment of the present application.
  • the network fault display device can display fault description information through its own display screen.
  • the network fault display device may be a notebook computer, and the notebook computer displays fault description information through a built-in display screen.
  • the network fault display device may also display fault description information through other methods.
  • the network fault display device sends the fault description information to the display screen through the connection line, and the display screen directly displays the fault description information.
  • connection line can be various types of connection lines, for example, a high-definition multimedia interface (High-Definition Multimedia Interface, HDMI) connection line, a video graphics array (Video Graphics Array, VGA) connection line, a digital video interface (Digital Visual Interface, DVI) cable, Display Port (Display Port, DP) cable.
  • HDMI High-Definition Multimedia Interface
  • VGA Video Graphics Array
  • DVI Digital Visual Interface
  • DVI Display Port
  • DP Display Port
  • the network can be various types of networks, for example, wireless local area network, Ethernet, cellular network and so on.
  • the fault maintenance equipment in the maintenance center displays the fault description information, and at the same time sends the fault description information to the office equipment of the maintenance engineer, such as mobile phones, tablets, and personal computers.
  • the maintenance engineer can view the fault description information at any time according to the needs of the work site .
  • the method includes the following steps S500-S505.
  • Step S500 the network device sends an alarm/log to the network fault display device.
  • the network device can actively send alarms/logs to the network fault display device, and the network fault display device can also actively obtain alarms/logs from the network device.
  • Step S501 the network fault display device acquires information about multiple alarm events in the network.
  • Step S502 the network fault display device determines a network fault according to the information of multiple alarm events.
  • step S501 and step S502 reference may be made to the description of step S201 and step S202 in the foregoing method embodiments, and details are not repeated here.
  • Step S503 the network failure display device sends description information of the failure.
  • the network fault display device may send the description information of the fault to the terminal device.
  • Step S504 the terminal device receives the fault description information.
  • Step S505 the terminal device displays the fault description information in text form.
  • the fault description information received by the terminal device may be fault description information expressed in natural language, for example, the fault description information shown in FIG. 4a or FIG. 4d.
  • the terminal device can directly display the fault description information.
  • the fault description information received by the terminal device may also only include key content of the fault description information, for example, the location information of the root cause event, the identification of the root cause event, the influence range information of the root cause event, and the like.
  • the terminal device can assemble the key content, and then display the connected fault description information in text form. For example, the terminal device uses verbs or prepositions to connect the location information of the root cause event and the identification of the root cause event, and uses verbs or prepositions to connect the identification of the root cause event and the influence range information of the root cause event.
  • the terminal device when it receives the description information of the fault, it can display the description information of the fault in text form through the display interface of the terminal device, so as to present the description information of the fault to the user.
  • the process for the terminal device to display the description information in text form is the same as the process for the network fault display device to display the description information in text form.
  • step S203 in the method embodiment shown in FIG. 2 above, which will not be repeated here. .
  • the embodiment of the present application also provides a network device, which is used to implement step S200 shown in FIG. 2, or implement step S500 shown in FIG. to send alarms and/or logs to the network fault display device.
  • the network device can be configured to send alarms and/or logs to the network fault display device according to a preset period, and the network device can also be configured to send an alarm and/or log to the network fault display device when receiving a collection instruction from the network fault display device. Alerts and/or logs.
  • the embodiment of the present application further provides a terminal device, which implements step S504 and step S505 shown in FIG. 5 .
  • the terminal device is configured to receive the fault description information sent by the network fault display device, and display the fault description information through its own display interface.
  • FIG. 6 is a schematic structural diagram of a network fault display device 600 provided by an embodiment of the present application.
  • the network fault display device 600 is used to implement step S201-step S203 in FIG. 2 .
  • the network fault display device 600 includes: an acquisition module 601 , a determination module 602 and a display module 603 .
  • the obtaining module 601 is used for obtaining information of multiple alarm events in the network.
  • the determining module 602 is configured to determine a fault of the network according to information of multiple alarm events.
  • the display module 603 is configured to display the description information of the fault in text form through a display interface.
  • the network fault display device 600 provided by the embodiment shown in FIG. 6 executes the network fault display method, it only uses the division of the above-mentioned functional modules as an example for illustration. In practical applications, the above-mentioned functions can be allocated by Completion of different functional modules means that the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the network fault display device provided by the above embodiment is based on the same idea as the embodiment of the network fault display method shown in FIG. 2 , and its specific implementation process is detailed in the method embodiment, and will not be repeated here.
  • FIG. 7 is a schematic structural diagram of another network fault display device 700 provided by an embodiment of the present application.
  • the network fault display device 700 is used to implement the steps S501-S503 in FIG. 5 above. As shown in FIG. 7 , the network fault display device 700 includes: an acquiring module 701 , a determining module 702 and a sending module 703 .
  • the obtaining module 701 is used for obtaining information of multiple alarm events in the network.
  • the determining module 702 is configured to obtain description information of network faults according to information of multiple alarm events.
  • the sending module 703 is used for sending description information of the fault.
  • Other devices such as terminal devices, receive the description information of the fault, and display the description information in text form through their display interface.
  • the network fault display device 700 provided by the embodiment shown in FIG. 7 executes the network fault display method, it only uses the division of the above-mentioned functional modules as an example for illustration. In practical applications, the above-mentioned functions can be allocated by Completion of different functional modules means that the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the network fault display device provided by the above embodiment is based on the same idea as the embodiment of the network fault display method shown in FIG. 5 , and its specific implementation process is detailed in the method embodiment, and will not be repeated here.
  • FIG. 8 is a schematic diagram of a hardware structure of a network device 800 according to an embodiment of the present application.
  • the network device 800 may be the above-mentioned network fault display device or the above-mentioned terminal device.
  • the network device 800 includes a processor 810 , a memory 820 , a communication interface 830 and a bus 840 , and the processor 810 , the memory 820 and the communication interface 830 are connected to each other through the bus 840 .
  • Processor 810, memory 820, and communication interface 830 may also be connected in other connection manners than bus 840.
  • the memory 820 can be various types of storage media, such as random access memory (random access memory, RAM), read-only memory (read-only memory, ROM), non-volatile RAM (non-volatile RAM, NVRAM ), programmable ROM (programmable ROM, PROM), erasable PROM (erasable PROM, EPROM), electrically erasable PROM (electrically erasable PROM, EEPROM), flash memory, optical memory, hard disk, etc.
  • RAM random access memory
  • read-only memory read-only memory
  • NVRAM non-volatile RAM
  • PROM programmable ROM
  • PROM erasable PROM
  • EPROM erasable PROM
  • electrically erasable PROM electrically erasable PROM
  • flash memory optical memory, hard disk, etc.
  • the processor 810 may be a general-purpose processor, and the general-purpose processor may be a processor that performs specific steps and/or operations by reading and executing contents stored in a memory (such as the memory 820 ).
  • the general processor may be a central processing unit (CPU).
  • the processor 810 may include at least one circuit to execute all or part of the steps of the network fault display method provided in the embodiment shown in FIG. 2 or FIG. 5 .
  • the communication interface 830 includes an input/output (input/output, I/O) interface, a physical interface and a logical interface, etc., which are used to realize the interconnection of devices inside the network device 800, and are used to realize the connection between the network device 800 and other devices. (such as other network equipment or user equipment) interconnection interface.
  • the physical interface can be Ethernet interface, optical fiber interface, ATM interface, etc.
  • bus 840 may be any type of communication bus for interconnecting the processor 810, the memory 820 and the communication interface 830, such as a system bus.
  • the above-mentioned devices may be respectively arranged on independent chips, or at least partly or all of them may be arranged on the same chip. Whether each device is independently arranged on different chips or integrated and arranged on one or more chips often depends on the needs of product design.
  • the embodiments of the present application do not limit the specific implementation forms of the foregoing devices.
  • the network device 800 shown in FIG. 8 is only exemplary. During implementation, the network device 800 may also include other components, which will not be listed here.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present invention will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, DVD), or a semiconductor medium (for example, a solid state disk (solid state disk, SSD)), etc.
  • a magnetic medium for example, a floppy disk, a hard disk, or a magnetic tape
  • an optical medium for example, DVD
  • a semiconductor medium for example, a solid state disk (solid state disk, SSD)

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

公开了一种网络故障显示方法及设备,属于网络技术领域。该方法包括:获取网络中的多个告警事件的信息,根据该多个告警事件的信息确定网络的故障,并以文字的形式显示故障的描述信息。其中,故障的描述信息包括根因事件的位置信息、该根因事件的描述信息和该根因事件的影响范围信息。该方法以文字形式显示引起网络故障的根因事件信息和该根因事件的影响范围,使得用户可以直观地了解网络中发生的故障。

Description

一种网络故障显示方法及设备
本申请要求于2021年08月31日提交中国国家知识产权局、申请号为202111016800.9、申请名称为“一种网络故障显示方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及网络技术领域,尤其涉及一种网络故障显示方法及设备。
背景技术
网络发生故障的原因较为复杂。例如,在数据中心网络(data center network,DCN)中,地址解析协议(address resolution protocol,ARP)表项超限、设备重启或路由器标识(router identity,RI)冲突等均会导致网络发生故障。
网络故障运维系统的前台界面往往只呈现故障相关的各种告警和拓扑路径等信息。用户需要熟悉运维工具操作手册及告警含义,才能知道网络中发生的具体故障及产生的原因。
发明内容
本申请提供一种网络故障显示方法及设备,以文字形式显示引起故障的根因事件信息和该根因事件的影响范围信息,使得用户可以直观地了解到网络中某个设备处发生的异常事件引发了网络故障以及其他的告警事件。
第一方面,提供一种网络故障显示方法。该方法包括:获取网络中的多个告警事件的信息,根据该多个告警事件的信息确定网络的故障,并以文字的形式显示该故障的描述信息。
其中,该故障的描述信息包括根因事件的位置信息、根因事件的描述信息和根因事件的影响范围信息。根因事件是多个告警事件中的一个告警事件。多个告警事件中的其他告警事件为根因事件的结果事件。根因事件的影响范围信息包括结果事件中的至少一个结果事件的信息。
一次故障可能会导致多个设备产生多个告警事件。该方案区分多个告警事件中的根因事件和结果事件,并以文字的形式显示引起故障的根因事件的位置信息和描述信息,以及该根因事件的影响范围。所以,用户无需熟悉运维工具操作手册和各种告警事件的含义,即可直观地看到某个设备处发生的异常事件引发了网络故障以及其他的告警事件。该方案降低了故障运维的难度,提高了故障运维的效率。
根据第一方面,在第一方面的第一种可能的实现方式中,至少一个结果事件的信息包括至少一个结果事件的位置信息和至少一个结果事件的描述信息。
根据第一方面,或第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,以文字的形式显示故障的描述信息包括使用自然语言显示故障的描述 信息。其中,根因事件的位置信息和根因事件的描述信息通过动词连接,根因事件的描述信息和根因事件的影响范围信息通过动词连接。或者,根因事件的描述信息和根因事件的位置信息通过介词连接,根因事件的描述信息和根因事件的影响范围信息通过动词连接。
例如,若A、a和X分别表示根因事件的位置信息、根因事件的描述信息和根因事件的影响范围信息,则故障的描述信息可以是“A发生a,导致X”,还可以是“在A的a,引起X”。其中,根因事件的位置信息指示根因事件发生的位置,可以包括根因事件所属实体的属性信息。根因事件所属实体的属性信息可以是根因事件所属实体的名称、类型、身份(identity,ID)和互联网协议(Internet Protocol,IP)地址等的一种或多种。
使用符合用户表达习惯的自然语言显示故障的描述信息,可以让用户更加方便的理解网络的故障。
根据第一方面,或以上第一方面的任一种可能的实现方式,在第一方面的第三种可能的实现方式中,根因事件的描述信息包括该根因事件的标识和/或该根因事件的含义。根因事件的标识可以是根因事件的名称或者根因事件的ID。根因事件的含义可以是根因事件的语义描述信息。
根据第一方面,或以上第一方面的任一种可能的实现方式,在第一方面的第四种可能的实现方式中,故障的描述信息还包括至少一个聚类组的信息。至少一个聚类组中的每个聚类组包括结果事件中的一个或多个结果事件。
将结果事件聚类以按照类别显示各个结果事件,可以让用户更清楚的知道该根因事件导致了哪些类型的告警事件。
根据第一方面的第四种可能的实现方式,在第一方面的第五种可能的实现方式中,使用自然语言显示至少一个聚类组的信息。其中,至少一个聚类组中的每个聚类组包括的结果事件的位置信息,和,至少一个聚类组中的每个聚类组包括的结果事件的描述信息通过动词连接。
根据第一方面的第四种可能的实现方式或第一方面的第五种可能的实现方式,在第一方面的第六种可能的实现方式中,根据分类条件对结果事件聚类以获取至少一个聚类组。其中,该分类条件包括以下一种或多种:结果事件的类型、结果事件的含义、结果事件所属实体、结果事件所属实体之间的拓扑关系、结果事件对应的层级。
根据第一方面,或以上第一方面的任一种可能的实现方式,在第一方面的第七种可能的实现方式中,根据多个告警事件的信息确定网络的故障包括:根据多个告警事件的信息和故障定位证据链确定根因事件。其中,故障定位证据链指示多个告警事件之间的因果关系。
根据第一方面,或以上第一方面的任一种可能的实现方式,在第一方面的第八种可能的实现方式中,发送故障的描述信息。
其他设备,例如,用户设备,可以接收该故障的描述信息,并以文字形式显示该故障的描述信息。
第二方面,提供一种网络故障显示设备。该网络故障显示设备包括获取模块、确 定模块和显示模块。
其中,该获取模块用于获取网络中的多个告警事件的信息。
其中,该确定模块用于根据该多个告警事件的信息确定网络的故障。
其中,该显示模块用于以文字形式显示该故障的描述信息。
其中,该故障的描述信息包括根因事件的位置信息、根因事件的描述信息和根因事件的影响范围信息。根因事件是该多个告警事件中的一个告警事件。该多个告警事件中的其他告警事件为该根因事件的结果事件。根因事件的影响范围信息包括结果事件中的至少一个结果事件的信息。
根据第二方面,在第二方面的第一种可能的实施方式中,至少一个结果事件的信息包括至少一个结果事件的位置信息和至少一个结果事件的描述信息。
根据第二方面,或第二方面的第一种可能的实施方式中,在第二方面的第二种可能的实施方式中,显示模块用于使用自然语言显示故障的描述信息。
其中,根因事件的位置信息和根因事件的描述信息通过动词连接,根因事件的描述信息和根因事件的影响范围信息通过动词连接。或者,根因事件的描述信息和根因事件的位置信息通过介词连接,根因事件的描述信息和根因事件的影响范围信息通过动词连接。
根据第二方面,或以上第二方面的任一种可能的实施方式,在第二方面的第三种可能的实现方式中,根因事件的描述信息包括根因事件的标识和/或根因事件的含义。
根据第二方面,或以上第二方面的任一种可能的实施方式,在第二方面的第四种可能的实施方式中,故障的描述信息还包括至少一个聚类组的信息。至少一个聚类组中的每个聚类组包括结果事件中的一个或多个结果事件。
根据第二方面的第四种可能的实施方式,在第二方面的第五种可能的实施方式中,显示模块还用于使用自然语言显示至少一个聚类组的信息。其中,至少一个聚类组中的每个聚类组包括的结果事件的位置信息,和,至少一个聚类组中的每个聚类组包括的结果事件的描述信息通过动词连接。
根据第二方面的第四种可能的实现方式,或第二方面的第五种可能的实施方式,在第二方面的第六种可能的实施方式中,显示模块用于基于分类条件对结果事件进行聚类以获取至少一个聚类组。其中,分类条件包括以下一种或多种:结果事件的类型、结果事件的含义、结果事件所属实体、结果事件所属实体之间的拓扑关系、结果事件对应的层次。
根据第二方面,或以上第二方面的任一种可能的实施方式,在第二方面的第七种可能的实施方式中,确定模块根据所述多个告警事件的信息确定所述网络的故障包括:确定模块根据多个告警事件的信息和故障定位证据链确定根因事件。其中,故障定位证据链指示多个告警事件之间的因果关系。
根据第二方面,或以上第二方面的任一种可能的实施方式,在第二方面的第八种可能的实施方式中,该网络故障显示设备还包括发送模块。该发送模块用于发送故障的描述信息。
第三方面,提供一种网络故障显示设备。该网络故障显示设备包括处理器和存储 器。处理器用于执行存储于存储器内的计算机程序以实现前述第一方面或第一方面的任意一种可能的实现方式所提供的网络故障显示方法。
第四方面,提供一种计算机可读存储介质,计算机可读存储介质内存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面的任意一种可能的实现方式所提供的网络故障显示方法。
第五方面,提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面的任意一种可能的实现方式所提供的网络故障显示方法。。
附图说明
图1是一种数据中心网络的示意图;
图2是本申请实施例提供的一种网络故障显示方法的流程图;
图3a是本申请实施例提供的一种网络设备发生RI冲突的日志中的关系示意图;
图3b是本申请实施例提供的一种网络设备发生RI冲突的日志中的实体信息的示意图;
图3c是本申请实施例提供的一种网络设备发生RI冲突的日志中的告警信息的示意图;
图4a是本申请实施例提供的一种故障的描述信息的示意图;
图4b是本申请实施例提供的另一种故障的描述信息的示意图;
图4c是本申请实施例提供的另一种故障的描述信息的示意图;
图4d是本申请实施例提供的另一种故障的描述信息的示意图;
图5是本申请实施例提供的另一种网络故障显示方法的流程图;
图6是本申请实施例提供的一种网络故障显示设备的结构示意图;
图7是本申请实施例提供的另一种网络故障显示设备的结构示意图;
图8是本申请实施例提供的另一种网络故障显示设备的结构示意图。
具体实施方式
为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本申请实施例中的技术方案进行描述。
在本申请实施例的描述中,“示例性的”或者“例如”等词用于表示作例子或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请实施例的描述中,术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,单独存在B,同时存在A和B这三种情况。另外,除非另有说明,术语“多个”的含义是指两个或两个以上。例如,多个系统是指两个或两个以上的系统,多个屏幕终端是指两个或两个以 上的屏幕终端。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
图1是一种数据中心网络的拓扑结构示意图。如图1所示,该数据中心网络采用叶脊网络架构,具体包括:至少一个脊(spine)交换机、至少一个叶(leaf)交换机和至少一个服务器(server)。例如,该至少一个脊交换机包括脊交换机101和脊交换机102,该至少一个叶交换机包括叶交换机111~114,该至少一个服务器包括服务器121~124。其中,脊交换机可以是汇聚交换机,叶交换机可以是接入交换机,脊交换机和叶交换机之间是全连接关系。叶交换机连接至少一个服务器。其中,服务器可以提供多种类型的服务,例如,网页服务、视频服务和存储服务。
图1所示的数据中心网络仅是本申请实施例的一个应用场景图。本申请实施例还可以应用于其他网络,例如,企业园区网、校园网、运营商网络。其中,运营商网络可以是运营商接入网、运营商传输网、或者包括接入网和传输网的运营商的端到端的网络。
当网络发生故障时,该网络中的多个设备可能会产生多个告警。例如,若图1中所述的叶交换机111连接脊交换机101的接口断开了,则叶交换机111会产生接口断开、协议异常等告警事件,脊交换机101也会产生接口断开、协议异常等告警事件,服务器121会产生业务断开、传输控制协议断连等告警事件。如果故障运维界面仅显示该多个告警事件,则运维人员需要逐个点击每个告警事件并学习每个告警事件的具体含义,以推理告警事件之间的关联性,从而获取该故障产生的具体原因。这增加了故障运维的难度,降低了处理网络故障的效率。
鉴于此,本申请实施例提供一种网络故障显示方法。在该方法中,网络故障显示设备获取网络中的多个告警事件的信息,根据该多个告警事件的信息确定该网络的故障,并以文字的形式显示该故障的描述信息。其中,该故障的描述信息包括根因事件的位置信息、该根因事件的描述信息和该根因事件的影响范围信息。该根因事件是多个告警事件中的一个告警事件。多个告警事件中的其他告警事件为该根因事件的结果事件。该根因事件的影响范围信息包括结果事件中的至少一个结果事件的信息。即,用户通过查看该故障的描述信息即可清楚的了解到:发生在哪个位置的哪个异常事件导致网络发生了故障,并引发了其他告警事件的产生。本申请实施例的详细方案请参考下述描述。
图2是本申请实施例提供的一种网络故障显示方法的流程图。该方法应用于网络故障显示设备。如图2所示,该方法包括如下的步骤S201-步骤S203。
步骤S201、获取网络中的多个告警事件的信息。
告警事件可以是多种类型的事件。例如,告警事件可以是性能异常事件(例如,业务的时延超过阈值,或者网络的丢包率超过阈值)。又例如,告警事件可以是设备的状态异常事件(例如,设备的接口断开,或者设备上运行的协议状态错误)。又例如,告警事件可以是设备日志中记录的异常事件。
本实施例中,网络故障显示设备可以从各个网络设备获得多个告警事件的信息。在一个示例中,网络故障显示设备可以发送采集指令以从各个网络设备获取告警事件的信息或存储告警事件信息的日志。在一个示例中,网络设备也可以按照设定的周期定时地向网络故障显示设备发送告警事件的信息或存储告警事件信息的日志,例如图2所示的步骤S200。在一个示例中,网络设备也可以在检测到告警事件时向网络故障显示设备发送告警事件的信息。网络设备可以是多种类型的设备,例如,图1所示数据中心网络中的服务器和交换机。
在一个示例中,告警事件的信息可以包括:实体与实体的连接关系、告警事件与其所属实体的关联关系、告警事件所属实体的属性信息和告警事件的属性信息中的一种或多种。其中,实体可以是网络设备、网络设备的接口或者网络设备中的协议模块等。下面将结合附图3a-图3c对告警事件的各个信息进行举例说明。图3a-图3c是网络设备发生RI冲突的日志示意图。其中,图3a示出了实体与实体的连接关系以及告警事件与实体的关联关系,图3b示出了实体的信息,图3c示出了告警事件的属性信息。
具体地,实体与实体的连接关系表征两个实体直接连接。例如,若实体1和实体2均是网络设备,则实体1与实体2的连接关系表示该实体1和实体2直接连接。又例如,若实体3是接口,则实体1和实体3的连接关系表示该实体3是实体1的接口。实体与实体之间的连接关系可以通过关联两个实体的标识表示。实体的标识可以是实体的ID。如图3a所示,“edges”类中的第一组“srcResId”和“tarResId”指示实体与实体的连接关系。其中,“srcResId”指示源实体的ID,具体为“83b97583-b7a9-4e40-a149-de1c4ec9674f”,“tarResId”指示目标实体的ID,具体为“f31d3ce9-4fa0-3276-914f-b19821e00b7a”。
结合图3b可知,ID为f31d3ce9-4fa0-3276-914f-b19821e00b7a的实体为接口,ID为83b97583-b7a9-4e40-a149-de1c4ec9674f为网络设备。即,该第一组信息表示ID为f31d3ce9-4fa0-3276-914f-b19821e00b7的接口是ID为83b97583-b7a9-4e40-a149-de1c4ec9674f的网络设备的一个接口。
具体地,告警事件与告警事件所属实体的关联关系表征该实体发生了该告警事件。类似的,可以通过关联告警事件的标识和告警事件所属实体的标识表示告警事件与所属实体之间的关联关系。告警事件的标识可以是告警事件的ID。继续参阅图3a,“edges”类中的第二组“srcResId”和“tarResId”指示告警事件与告警事件所属实体之间的关联关系。其中,“srcResId”指示告警事件的ID,具体为“346705413275670”,“tarResId”指示告警事件所属的实体的ID,具体为“f31d3ce9-4fa0-3276-914f-b19821e00b7a”。即,该第二组信息表示ID为f31d3ce9-4fa0-3276-914f-b19821e00b7a的实体发生了ID为346705413275670的告警事件。
具体地,实体的属性信息可以包括:实体的类型、名称、IP地址和ID等一种或多种。如图3b所示,ID为f31d3ce9-4fa0-3276-914f-b19821e00b7a的实体的类型为“Interface”、名称为“10GE3/0/10”、IP地址为“192.168.7.2”。根据该实体的类型和名称可以确定该实体为一个网络设备的接口。结合图3a可知,该接口发生了一个ID为346705413275670的告警事件。继续参考图3b,ID为 83b97583-b7a9-4e40-a149-de1c4ec9674f的实体的类型为“NetworkElement”、名称为“DC1-spine-01”、IP地址为“192.84.21.109”。结合图3a可知,该网络设备包含一个ID为f31d3ce9-4fa0-3276-914f-b19821e00b7a的接口。
具体地,告警事件的属性信息可以包括:告警事件的名称、语义描述信息、ID、等级和位置信息的一种或多种。其中,位置信息指示告警事件发生的位置,可以包括告警事件所属实体的属性信息。告警事件所属实体的属性信息可以是告警事件所属实体的名称、类型、ID、IP地址。如图3c所示,告警事件的名称为“conflictrouteridintf”,告警事件的等级为“2”,告警事件的ID为“346705413275670”,告警事件的语义描述信息为“OSPF router ID conflict is detected on the interface”。继续参阅图3c,告警事件所属实体的类型为“Interface”,告警事件所属实体的名称为“DC1-spine-01”,告警事件所属实体的ID为“f31d3ce9-4fa0-3276-914f-b19821e00b7a”,告警事件所属实体的IP地址“192.168.7.2”。
步骤S202、根据多个告警事件的信息确定网络的故障。
本实施例中,网络故障显示设备可以基于多个告警事件的信息,对多个告警事件进行分析归纳,以确定网络的故障。在一个示例中,网络故障显示设备可以使用故障定位证据链,确定多个告警事件中的根因事件,以及根因事件引起的结果事件。故障定位证据链指示多个告警事件之间的因果关系,可以包括一个或多个结果事件的信息以及导致该一个或多个结果事件发生的根因事件的信息。网络故障显示设备可以比较获取到的告警事件的信息和故障定位证据链中各个告警事件的信息,以确定多个获取到的告警事件中的根因事件和结果事件。
例如,故障定位证据链可以为“事件L1的信息->事件L2的信息->事件L3的信息->事件L4的信息”。其中,事件L1为故障定位证据链中的根因事件,事件L2-L4为事件L1引起的结果事件。事件L2-L4中事件L2也是事件L3的因事件,事件L3是事件L4的因事件。事件L1-L4的信息可以包括前述步骤S201中列出的告警事件的信息。网络故障显示设备可以比较告警事件的信息与故障定位证据链事件L1-事件L4的信息。当该告警事件的信息与事件L1的信息一致时,该告警事件即为根因事件。当该告警事件的信息与事件L2(或L3,或L4)的信息一致时,该告警事件即为结果事件。
故障定位证据链还可以只包括根因事件的一个信息和该根因事件引起的结果事件的一个信息。例如,故障定位证据链可以包括根因事件的名称和该根因事件引起的结果事件的名称。以上述RI冲突故障为例,用事件名称表示的故障定位证据链具体可以是如下四条:
OSPFRouterIDConflict->OSPFPeerDisConnect->OSPFPeerDisConnect->OSPFRouterIDConflict;
O SPFRouterIDConflict->VxlanTunnelDown:
OSPFRouterIDConflict->OSPFPeerDisConnect->OSPFPeerDisConnect->conflictrouteridintf->hwofpsessiondownactive;
OSPFRouterIDConflict->OSPFPeerDisConnect->hwofpsessiondownactive->BGPPeerDisConnect->BGPPeerDisConnect。
其中,上述第一个证据链中,第一个事件的名称OSPFRouterIDConflict指示根因 事件的名称,OSPFRouterIDConflict之后的三个名称均指示该根因事件引起的结果事件的名称。其中,第二个事件名称OSPFPeerDisConnect表示的结果事件也是第三个事件名称OSPFPeerDisConnect表示的结果事件的因事件。以此类推,也可以理解为多个结果事件是根因事件所引起的连锁反应。上述第二个至第四个证据链的解释与第一个证据链相同,此处不再赘述。
例如,故障定位证据链还可以只包括根因事件所属实体的名称和该根因事件引起的结果事件所属实体的名称。与上述包含事件名称的故障定位证据链相对应的,用实体名称表示的故障定位证据链可以包括如下四条:
OSPFArea->OSPFNetwork->OSPFNetwork->OSPFArea;
O SPFArea->VXLANTunnel;
OSPFArea->OSPFNetwork->OSPFNetwork->EnterpriseInterfacePhysical;
OSPFArea->OSPFNetwork->BGPPeer->BGPPeer。
可以理解的,以第一个证据链为例,第一个实体名称表示的是根因事件所属实体的名称,第一个实体名称之后的实体名称表示结果事件所属实体的名称。
应理解,用事件的单一信息表示的故障定位证据链不限于本申请上述示出的两种,还可以用告警事件的其他信息来表示。此外,在用事件的单一信息表示的故障定位证据链中,两个告警事件的该类别的信息可能相同,但不表示两个事件是同一个事件,它们的其他信息会存在不同。因此,在利用单一信息的故障定位证据链对事件的信息进行分析时,可以结合故障定位证据链对应的多个维度的信息确定告警事件是根因事件还是结果事件。例如,当告警事件的名称与故障定位证据链中两个事件的名称相同时,可以再比较该告警事件与两个事件的实体的名称是否相同,从而确定告警事件与两个事件中的哪个事件的信息一致。再例如,当该告警事件的实体的名称与两个事件的实体的名称也相同时,还可以再从告警事件的实体的IP地址角度比对。
步骤S203、以文字形式显示故障的描述信息。
其中,故障的描述信息包括:根因事件的位置信息、根因事件的描述信息和根因事件的影响范围信息。具体地,根因事件的影响范围信息包括结果事件中的至少一个结果事件的信息。至少一个结果事件的信息可以包括至少一个结果事件的位置信息和至少一个结果事件的描述信息。事件(包括根因事件和结果事件)的描述信息可以包括事件的标识和/或事件的含义。事件的标识可以是事件的名称或者ID。事件的含义可以是事件的语义描述信息。
在一个示例中,网络故障显示设备可以通过自然语言显示故障的描述信息。在一个示例中,网络故障显示设备可以使用连接词对根因事件的位置信息、根因事件的描述信息和根因事件的影响范围信息进行文字连接,以获得故障的描述信息。其中,连接词可以是动词和介词等类型的词汇。
例如,网络故障显示设备可以使用第一动词连接根因事件的位置信息和根因事件的描述信息,以指示在该位置处发生了该根因事件。网络故障显示设备可以使用第二动词连接根因事件的描述信息和根因事件的影响范围信息以指示该根因事件的影响范围。其中,第一动词可以是“发生”、“存在”、“产生”、“出现”等,第二动词可以是“导致”、“引起”、“造成”等。类似地,网络故障显示设备还可以使用第一动词连接根因事 件的影响范围信息中的结果事件的位置信息和结果事件的描述信息。例如,假设根因事件的位置信息为A,根因事件的描述信息为a,根因事件的一个结果事件的位置信息为B,该结果事件的描述信息为b,则故障的描述信息可以为“A发生a,导致B发生b”,或者“A存在a,导致B发生b”,或者“A产生a,导致B发生b”,或者“A出现a,导致B发生b”,或者“A发生a,引起B产生b”。可以理解,本申请中故障的描述信息不限于使用本申请列举出的动词。
又例如,网络故障显示设备还可以使用第一介词连接根因事件的描述信息和根因事件的位置信息,以指示该位置处发生了该根因事件。网络故障显示设备可以使用第二动词连接根因事件的描述信息和影响范围信息。第一介词可以是“在”等。类似地,网络故障显示设备还可以使用第一动词或第一介词连接根因事件的影响范围信息中的结果事件的位置信息和描述信息。例如,假设根因事件的位置信息为A,根因事件的描述信息为a,根因事件的一个结果事件的位置信息为B,该结果事件的描述信息为b,则故障的描述信息可以为“在A的a,导致在B的b”,或者“在A的a,引起B发生b”,或者“在A的a,造成B出现b”,或者“a在A导致b在B”,或者“a在A导致B发生b”等。
以前述的RI冲突故障为例,如图4a所示,故障的描述信息具体可以是“serverleaf02_1发生hwOspfv2IntraAreaRouteridConflict,造成DC1-spine-01发生bgpBackwardTransition_active”。如图4b所示,故障的描述信息还可以是“在serverleaf02_1的hwOspfv2IntraAreaRouteridConflict,导致DC1-spine-01发生bgpBackwardTransition_active”。其中,serverleaf02_1为根因事件的位置信息,hwOspfv2IntraAreaRouteridConflict为根因事件的描述信息,DC1-spine-01为根因事件引起的结果事件的位置信息,bgpBackwardTransition_active为该结果事件的描述信息。
在一个示例中,当根因事件引起多个结果事件时,根因事件的影响范围信息可以包括该多个结果事件的信息。例如,故障的描述信息可以为“A发生a,导致B发生b、C发生c”。其中,b和c是根因事件a引起的两个结果事件的描述信息。
在一个示例中,当根因事件引起多个结果事件时,网络故障显示设备还可以在根因事件的影响范围信息中包括该根因事件引起的结果事件的数量。例如,故障的描述信息可以为“A发生a,导致B发生b等M个异常”,或者,故障的描述信息可以为“A发生a,导致B、C等发生M个异常”。其中,M表示根因事件引起的结果事件的数量。
以前述的RI冲突故障为例,如图4c所示,故障的描述信息还可以是:“serverleaf02_1发生hwOspfv2IntraAreaRouteridConflict,导致DC1-spine-01发生bgpBackwardTransition_active等2个异常”。图4c中,根因事件引起的结果事件的数量为2。
在一个示例中,当根因事件引起多个结果事件时,网络故障显示设备还可以从多个结果事件中选择一个结果事件作为第一事件,以在故障的描述信息中使用第一事件的位置信息和描述信息作为根因事件的影响范围信息。例如,网络故障显示设备可以选择影响程度较大的结果事件。例如,故障的描述信息“A发生a,导致B发生b”中的B和b可以替换为第一事件的位置信息和第一事件的描述信息。
例如,网络故障显示设备可以选择多个结果事件中最靠近应用层的结果事件作为 第一事件。网络故障显示设备可以将结果事件所属实体在通信模型中所处层级与应用层之间的间隔作为判断结果事件靠近应用层程度的标准。例如,在开放系统互联参考模型(open system interconnection reference model,OSI参考模型)中,通信模型的层级包括应用层、表示层、会话层、传输层、网络层、数据链路层和物理层。结果事件所属的实体可以对应到OSI参考模型的层级。例如,若结果事件所属的实体为网络设备的接口,则其可对应到物理层;若结果事件所属的实体为媒体接入控制(media access control,MAC)协议模块,则其可对应到数据链路层;若结果事件所述的实体为路由信息协议(routing information protocol,RIP)协议模块,则其可对应到网络层;若结果事件所属的实体为传输控制协议(transport control protocol,TCP)模块,则其可对应到传输层。结果事件所属的实体在通信模型中的层级与应用层之间的间隔越小,则其越靠近应用层。例如,若结果事件包括结果事件A和结果事件B,结果事件A所属实体的层级为传输层,结果事件B所属实体的层级为物理层,则结果事件B最靠近应用层,网络故障显示设备可以选择结果事件B作为第一事件。
在一个示例中,故障的描述信息还可以只包括根因事件的位置信息、根因事件的标识和根因事件的含义。网络故障显示设备可以使用动词连接根因事件的位置信息和根因事件的标识,使用介词连接根因事件的标识和根因事件的含义。其中,根因事件的含义可以是根因事件的语义描述信息。例如,若根因事件的位置信息为A,根因事件的标识为a,根因事件的含义为X,则故障的描述信息可以是“A发生a,即X”。以RI冲突故障为例,故障的描述信息可以是“serverleaf02_1OSPF进程2域0.0.0.0发生hwOspfv2IntraAreaRouteridConflict,即OSPF在区域内检测到路由器标识冲突”。其中,“serverleaf02_1OSPF进程2域0.0.0.0”是根因事件发生的位置,“hwOspfv2IntraAreaRouteridConflict”是根因事件的标识,“OSPF在区域内检测到路由器标识冲突”是根因事件的含义。
在一个示例中,网络故障显示设备还可以更详细地显示根因事件引起的结果事件的信息。例如,网络故障显示设备对多个结果事件进行聚类,故障的描述信息还可以包括至少一个聚类组的信息。至少一个聚类组中的每个聚类组包括一个或多个结果事件。网络故障显示设备可以使用自然语言显示至少一个聚类组的信息。其中,至少一个聚类组中的每个聚类组包括的结果事件的位置信息,和,所述至少一个聚类组中的每个聚类组包括的结果事件的描述信息通过动词连接。网络故障显示设备可以按照分类条件,对多个结果事件进行分类,以将多个结果事件划分为至少一个聚类组。其中,分类条件可以包括结果事件的类型、结果事件的含义、结果事件所属实体、结果事件所属实体之间的拓扑关系、结果事件对应的层级等。其中,每个聚类组包括相似的结果事件。
例如,若根因事件导致3个网络设备(网络设备0~网络设备2)共发生了8个结果事件(结果事件0~结果事件9),网络故障显示设备可以根据分类条件对该8个结果事件进行聚类以获得多个聚类组。其中,该7个结果事件分别为:
结果事件0:“网络设备0的接口1断开”;
结果事件1:“网络设备0的路由协议错误”;
结果事件2:“网络设备0的TCP连接断开”;
结果事件3:“网络设备1的接口1断开”;
结果事件4:“网络设备1的路由协议错误”;
结果事件5:“网络设备2的接口2断开”;
结果事件6:“网络设备2的路由协议错误”;
结果事件7:“网络设备2的RIP错误”
结果事件8:“网络设备2的TCP连接断开”。
例如,网络故障显示设备按照结果事件的类型对结果事件进行聚类,可以获得3个聚类组(例如,聚类组A~C)。其中,聚类组A可以对应于“接口断开”类的结果事件,包括结果事件0、结果事件3和结果事件5。其中,聚类组B可以对应于“协议错误”类结果事件,包括结果事件1、结果事件4和结果事件6。其中,聚类组C可以对应于“TCP连接断开”类结果事件,包括结果事件2和结果事件8。
例如,网络故障显示设备按照结果事件的含义进行聚类,可以将结果事件6和结果事件7聚为一个聚类组,例如,聚类组D。
例如,网络故障显示设备按照结果事件所属实体对结果事件进行聚类,可以获得3个聚类组(例如,聚类组E~G)。其中,聚类组E包括结果事件0~2,该3个结果事件均属于网络设备0。其中,聚类组F包括结果事件3~4,该2个结果事件均属于网络设备1。其中,聚类组G包括结果事件5~8,该4个结果事件均属于网络设备2。
网络故障显示设备按照结果事件所属实体之间的拓扑关系对结果事件进行聚类,可以获得一个或多个聚类组。例如,该拓扑关系可以是实体之间的跳数。例如,若网络设备0和网络设备1,网络设备2通过多个网络设备与网络设备0或网络设备1连接,则网络故障显示设备可以获取2个聚类组(例如,聚类组H~I)。其中,聚类组H包括结果事件0~4,聚类组I包括结果事件5~8。
例如,网络故障显示设备按照结果事件对应的层级对结果事件进行聚类,可以获得3个聚类组(例如,聚类组J~L)。其中,聚类组J包括结果事件0、结果事件3和结果事件5,该3个结果事件对应的层级为物理层。其中,聚类组K包括结果事件1、结果事件4、结果事件6和结果事件7,该4个结果事件对应的层级为网络层。其中,聚类组L包括结果事件2和结果事件8,该3个结果事件对应的层级为TCP层。
网络故障显示设备可以在故障的描述信息中包括上述一个或多个聚类组的信息。一个聚类组的信息可以由网络故障显示设备根据自然语言生成模型处理一个聚类组包括的结果事件的信息获得。网络故障显示设备还可以提取各个结果事件的位置信息、标识、含义等,以在每个聚类组中合并该聚类组包括的结果事件的位置信息,并用动词连接合并后的位置信息和各个结果事件的标识和/或含义。例如,故障的描述信息还可以包括如下信息:
网络设备0~2等3个实体发生如下事件:网络设备0的接口1断开;网络设备1的接口1断开;网络设备2的接口2断开。
网络设备0、2等2个实体发生如下事件:TCP连接断开。
网络设备0~3等3个实体发生如下事件:路由协议错误。
网络设备2发生如下事件:路由协议错误;RIP错误。
以网络发生RI冲突故障为例,故障的描述信息包括根因位置信息、根因描述信息 和根因影响范围信息(例如,“serverleaf02_1发生hwOspfv2IntraAreaRouteridConflict,造成DC 1-spine-01发生bgpBackwardTransition_active”,或者“在serverleaf02_1的hwOspfv2IntraAreaRouteridConflict,导致DC1-spine-01发生bgpBackwardTransition_active”)。故障的描述信息还可以包括多个聚类组的信息。例如,故障描述信息如图4d所示,还包括五个聚类组的信息:
“1.serverleaf02_1、serverleaf03_1等2个实体发生如下2类事件:OpenFlow连接中断;NTP时钟同步状态变化”;
2.serverleaf02_1接口25GE1/0/1等4个实体发生如下4类事件:接口板的某个BFD会话的状态转变为Down;BFD会话状态由其它状态变为Down;BFD会话状态由其它状态变为Up;
3.serverleaf02_1OSPF进程2域0.0.0.0网段192.168.7.0/255.255.255.252等4个实体发生如下3类事件:OSPF邻居状态发生变化,可能是由于该邻居所在的接口状态发生变化,或者收到的Hello报文中内容发生改变;非虚连接接口上重传一个OSPF报文,可能是由于物理链路不通;
4.DC1-spine-01OSPF进程1域0.0.0.0实体发生如下1类事件:OSPF在区域内检测到路由器标识冲突;
5.DC1-spine-01等2个实体发生如下1类事件:BGP状态机的状态值从高值状态变为低值状态并且前一个状态是Openconfirm状态或Established状态。”
其中,第一个聚类组包括以结果事件所属实体之间的拓扑关系为分类条件进行聚类获得的结果事件。其中,“serverleaf02_1”和“serverleaf03_1”可以是两个直接相连的网络设备。
其中,第二个聚类组和第三个聚类组包括以结果事件的含义为分类条件进行聚类获得的结果事件。其中,第二个聚类组包括双向转发检测(Bidirectional Forwarding Detection,BFD)状态变化相关的结果事件,第三个聚类组包括开放最短路径优先(Open Shortest Path First,OSPF)协议相关的结果事件。
其中,第四个聚类组包括以结果事件所属实体为分类条件进行聚类获得的结果事件,第五个聚类组包括以结果事件的类型为分类条件进行聚类获得的结果事件。
上述图4a~4d中的圆圈、框、箭头用于显示地标出故障描述信息包括的根因事件的位置信息、描述信息、影响范围信息,以及各种信息之间的连接词,以方便本申请的读者理解本申请。可以理解的是,上述图4a~4d中的圆圈、框、箭头等不是故障描述信息的必须包括的信息。
图2所示的方法实施例,网络故障显示设备可以通过自身的显示界面将故障的描述信息呈现给用户。该故障描述信息包括根因事件的位置信息、描述信息和该根因事件的影响范围信息。用户无需熟悉运维工具操作手册和各种告警事件的含义,即可通过显示界面直观地看到某个设备处发生的异常事件引发了网络故障以及其他的告警事件。网络故障显示设备还对结果事件进行分类,使得多个告警事件的呈现更清晰。该方案降低了故障运维的难度,提高了故障运维的效率。
图5是本申请实施例提供的另一种网络故障显示方法的流程图。
如图2所示,网络故障显示设备可以通过自身显示屏显示故障描述信息。例如,网络故障显示设备可以是笔记本电脑,笔记本电脑通过内置显示屏显示故障描述信息。如图5所示,网络故障显示设备还可以通过其他方法显示故障描述信息。例如,网络故障显示设备通过连接线将故障描述信息发送到显示屏,显示屏直接显示该故障描述信息。该连接线可以是多种类型的连接线,例如,高清多媒体界面(High-Definition Multimedia Interface,HDMI)连接线、视频图形阵列(Video Graphics Array,VGA)连接线、数字视频接口(Digital Visual Interface,DVI)连接线、显示端口(Display Port,DP)连接线。又例如,网络故障显示设备通过网络将故障描述信息发送到其他设备,例如,手机、个人电脑,其他设备显示故障描述信息。该网络可以是多种类型的网络,例如,无线局域网、以太网、蜂窝网等。例如,维护中心中的故障维护设备显示了故障描述信息,同时又将故障描述信息发送到维护工程师的办公设备,例如,手机、平板、个人电脑,维护工程师可以按工作地点需求随时查看故障描述信息。
如图5所示,该方法包括如下的步骤S500-步骤S505。
步骤S500、网络设备向网络故障显示设备发送告警/日志。
本实施例中,网络设备可以向网络故障显示设备主动发送告警/日志,网络故障显示设备也可以主动从网络设备获取告警/日志。
步骤S501、网络故障显示设备获取网络中多个告警事件的信息。
步骤S502、网络故障显示设备根据多个告警事件的信息确定网络的故障。
本实施例中,步骤S501和步骤S502的具体介绍可以参见前述方法实施例中对步骤S201和步骤S202的描述,此处不再赘述。
步骤S503、网络故障显示设备发送故障的描述信息。
本实施例中,网络故障显示设备可将所述故障的描述信息发送给终端设备。
步骤S504、终端设备接收故障的描述信息。
步骤S505、终端设备以文字形式显示故障的描述信息。
终端设备接收到的故障描述信息可以是以自然语言表示的故障描述信息,例如,如图4a或图4d所示的故障描述信息。此时,终端设备可以直接显示该故障描述信息。
终端设备接收到的故障描述信息也可以仅包括故障描述信息的关键内容,例如,根因事件的位置信息、根因事件的标识、根因事件的影响范围信息等。终端设备可以组装关键内容,然后再以文字的形式显示连接好的故障描述信息。例如,终端设备使用动词或介词以连接根因事件的位置信息和根因事件的标识,使用动词或介词连接根因事件的标识和根因事件的影响范围信息。
本实施例中,终端设备接收到故障的描述信息时,可以通过终端设备的显示界面,以文字形式显示所述故障的描述信息,以将故障的描述信息呈现给用户。具体地,终端设备以文字形式显示描述信息的过程与网络故障显示设备以文字形式显示描述信息的过程相同,具体可参见前述图2所示方法实施例中步骤S203的介绍,此处不再赘述。
基于上述图2和图5所示的方法实施例,本申请实施例还提供一种网络设备,该网络设备用于实现图2所示的步骤S200,或者实现图5所示的步骤S500,用于向所述 网络故障显示设备发送告警和/或日志。具体地,网络设备可以被配置为按照预设周期向网络故障显示设备发送告警和/或日志,网络设备还可以被配置为在接收到网络故障显示设备的采集指令时,向网络故障显示设备发送告警和/或日志。
基于上述图5所示的方法实施例,本申请实施例还提供一种终端设备,该终端设备实现图5所示的步骤S504和步骤S505。具体地,终端设备用于接收网络故障显示设备发送的故障描述信息,以及通过自身的显示界面显示所述故障的描述信息。
图6是本申请实施例提供的一种网络故障显示设备600的结构示意图。该网络故障显示设备600用于实现图2中的步骤S201-步骤S203。如图6所示,该网络故障显示设备600包括:获取模块601、确定模块602和显示模块603。
其中,获取模块601用于获取网络中多个告警事件的信息。
其中,确定模块602用于根据多个告警事件的信息确定网络的故障。
其中,显示模块603用于通过显示界面,以文字的形式显示所述故障的描述信息。
需要说明的是,图6所示实施例提供的网络故障显示设备600在执行网络故障显示方法时,仅以上述各功能模块的划分举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的网络故障显示设备与图2所示的网络故障显示方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图7是本申请实施例提供的另一种网络故障显示设备700的结构示意图。
该网络故障显示设备700用于实现上述图5中的步骤S501-步骤S503。如图7所示,该网络故障显示设备700包括:获取模块701、确定模块702和发送模块703。
其中,获取模块701用于获取网络中多个告警事件的信息。
其中,确定模块702用于根据多个告警事件的信息获得网络的故障的描述信息。
其中,发送模块703用于发送故障的描述信息。其他设备,例如,终端设备,接收该故障的描述信息,并通过其显示界面,以文字形式显示所述的描述信息。
需要说明的是,图7所示实施例提供的网络故障显示设备700在执行网络故障显示方法时,仅以上述各功能模块的划分举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的网络故障显示设备与图5所示的网络故障显示方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图8是本申请实施例提供一种网络设备800的硬件结构示意图。
该网络设备800可以为上述网络故障显示设备或者上述终端设备。参见图8,该网络设备800包括处理器810、存储器820、通信接口830和总线840,处理器810、存储器820和通信接口830通过总线840彼此连接。处理器810、存储器820和通信 接口830也可以采用除了总线840之外的其他连接方式连接。
其中,存储器820可以是各种类型的存储介质,例如随机存取存储器(random access memory,RAM)、只读存储器(read-only memory,ROM)、非易失性RAM(non-volatile RAM,NVRAM)、可编程ROM(programmable ROM,PROM)、可擦除PROM(erasable PROM,EPROM)、电可擦除PROM(electrically erasable PROM,EEPROM)、闪存、光存储器、硬盘等。
其中,处理器810可以是通用处理器,通用处理器可以是通过读取并执行存储器(例如存储器820)中存储的内容来执行特定步骤和/或操作的处理器。例如,通用处理器可以是中央处理器(central processing unit,CPU)。处理器810可以包括至少一个电路,以执行图2或者图5所示实施例提供的网络故障显示方法的全部或部分步骤。
其中,通信接口830包括输入/输出(input/output,I/O)接口、物理接口和逻辑接口等用于实现网络设备800内部的器件互连的接口,以及用于实现网络设备800与其他设备(例如其他网络设备或用户设备)互连的接口。物理接口可以是以太网接口,光纤接口,ATM接口等。
其中,总线840可以是任何类型的,用于实现处理器810、存储器820和通信接口830互连的通信总线,例如系统总线。
上述器件可以分别设置在彼此独立的芯片上,也可以至少部分的或者全部的设置在同一块芯片上。将各个器件独立设置在不同的芯片上,还是整合设置在一个或者多个芯片上,往往取决于产品设计的需要。本申请实施例对上述器件的具体实现形式不做限定。
图8所示的网络设备800仅仅是示例性的,在实现过程中,网络设备800还可以包括其他组件,本文不再一一列举。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如,固态硬盘(solid state disk,SSD))等。
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。应理解,在本申请实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑 确定,不应对本申请实施例的实施过程构成任何限定。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。

Claims (21)

  1. 一种网络故障显示方法,其特征在于,所述方法包括:
    获取网络中的多个告警事件的信息;
    根据所述多个告警事件的信息确定所述网络的故障;
    以文字的形式显示所述故障的描述信息;
    其中,所述故障的描述信息包括根因事件的位置信息、所述根因事件的描述信息和所述根因事件的影响范围信息,所述根因事件是所述多个告警事件中的一个告警事件,所述多个告警事件中的其他告警事件为所述根因事件的结果事件,所述根因事件的影响范围信息包括所述结果事件中的至少一个结果事件的信息。
  2. 根据权利要求1所述的方法,其特征在于,所述至少一个结果事件的信息包括所述至少一个结果事件的位置信息和所述至少一个结果事件的描述信息。
  3. 根据权利要求1或2所述的方法,其特征在于,所述以文字的形式显示所述故障的描述信息包括:使用自然语言显示所述故障的描述信息,其中,
    所述根因事件的位置信息和所述根因事件的描述信息通过动词连接,所述根因事件的描述信息和所述根因事件的影响范围信息通过动词连接;或者
    所述根因事件的描述信息和所述根因事件的位置信息通过介词连接,所述根因事件的描述信息和所述根因事件的影响范围信息通过动词连接。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述根因事件的描述信息包括所述根因事件的标识和/或所述根因事件的含义。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述故障的描述信息还包括至少一个聚类组的信息,所述至少一个聚类组中的每个聚类组包括所述结果事件中的一个或多个结果事件。
  6. 根据权利要求5所述的方法,其特征在于,所述方法包括:
    使用自然语言显示所述至少一个聚类组的信息,其中,所述至少一个聚类组中的每个聚类组包括的结果事件的位置信息,和,所述至少一个聚类组中的每个聚类组包括的结果事件的描述信息通过动词连接。
  7. 根据权利要求5或6所述的方法,其特征在于,所述方法包括:
    根据分类条件对所述结果事件进行聚类以获取所述至少一个聚类组;
    其中,所述分类条件包括以下一种或多种:所述结果事件的类型、所述结果事件的含义、所述结果事件所属实体、所述结果事件所属实体之间的拓扑关系、所述结果事件对应的层级。
  8. 根据权利要求1至7任一所述的方法,其特征在于,所述根据所述多个告警事件的信息确定所述网络的故障包括:
    根据所述多个告警事件的信息和故障定位证据链确定所述根因事件,其中,所述故障定位证据链指示所述多个告警事件之间的因果关系。
  9. 根据权利要求1至8任一所述的方法,其特征在于,所述方法还包括:发送所述故障的描述信息。
  10. 一种网络故障显示设备,其特征在于,所述网络故障显示设备包括:
    获取模块,用于获取网络中的多个告警事件的信息;
    确定模块,用于根据所述多个告警事件的信息确定所述网络的故障;
    显示模块,用于以文字形式显示所述故障的描述信息;
    其中,所述故障的描述信息包括根因事件的位置信息、所述根因事件的描述信息和所述根因事件的影响范围信息,所述根因事件是所述多个告警事件中的一个告警事件,所述多个告警事件中的其他告警事件为所述根因事件的结果事件,所述根因事件的影响范围信息包括所述结果事件中的至少一个结果事件的信息。
  11. 根据权利要求10所述的设备,其特征在于,所述至少一个结果事件的信息包括所述至少一个结果事件的位置信息和所述至少一个结果事件的描述信息。
  12. 根据权利要求10或11所述的设备,其特征在于,
    所述显示模块,用于使用自然语言显示所述故障的描述信息,其中,
    所述根因事件的位置信息和所述根因事件的描述信息通过动词连接,所述根因事件的描述信息和所述根因事件的影响范围信息通过动词连接;或者
    所述根因事件的描述信息和所述根因事件的位置信息通过介词连接,所述根因事件的描述信息和所述根因事件的影响范围信息通过动词连接。
  13. 根据权利要求10至12任一所述的设备,其特征在于,所述根因事件的描述信息包括所述根因事件的标识和/或所述根因事件的含义。
  14. 根据权利要求10至13任一所述的设备,其特征在于,所述故障的描述信息还包括至少一个聚类组的信息,所述至少一个聚类组中的每个聚类组包括所述结果事件中的一个或多个结果事件。
  15. 根据权利要求14所述的设备,其特征在于,
    所述显示模块,还用于使用自然语言显示所述至少一个聚类组的信息,其中,所述至少一个聚类组中的每个聚类组包括的结果事件的位置信息,和,所述至少一个聚类组中的每个聚类组包括的结果事件的描述信息通过动词连接。
  16. 根据权利要求14或15所述的设备,其特征在于,
    所述显示模块,用于基于分类条件对所述结果事件进行聚类以获取所述至少一个聚类组;
    其中,所述分类条件包括以下一种或多种:所述结果事件的类型、所述结果事件的含义、所述结果事件所属实体、所述结果事件所属实体之间的拓扑关系、所述结果事件对应的层级。
  17. 根据权利要求10至16任一所述的设备,其特征在于,所述确定模块根据所述多个告警事件的信息确定所述网络的故障包括:
    所述确定模块根据所述多个告警事件的信息和故障定位证据链确定所述根因事件,其中,所述故障定位证据链指示所述多个告警事件之间的因果关系。
  18. 根据权利要求10至17任一所述的设备,其特征在于,所述网络故障显示设备还包括发送模块,
    所述发送模块,用于发送所述故障的描述信息。
  19. 一种网络故障显示设备,其特征在于,所述网络故障显示设备包括:处理器和存储器,所述处理器用于执行存储于所述存储器内的计算机程序以实现权利要求1至9任一所述的方法。
  20. 一种计算机可读存储介质,其特征在于,包括指令,当所述指令在计算机上运行时,使得所述计算机执行如实现权利要求1至9任一所述的方法。
  21. 一种计算机程序产品,其特征在于,包括程序代码,当计算机运行所述计算机程序产品时,使得所述计算机执行如权利要求1至9任一所述的方法。
PCT/CN2022/115069 2021-08-31 2022-08-26 一种网络故障显示方法及设备 WO2023030183A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111016800.9 2021-08-31
CN202111016800.9A CN115733725A (zh) 2021-08-31 2021-08-31 一种网络故障显示方法及设备

Publications (1)

Publication Number Publication Date
WO2023030183A1 true WO2023030183A1 (zh) 2023-03-09

Family

ID=85291772

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115069 WO2023030183A1 (zh) 2021-08-31 2022-08-26 一种网络故障显示方法及设备

Country Status (2)

Country Link
CN (1) CN115733725A (zh)
WO (1) WO2023030183A1 (zh)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112416645A (zh) * 2020-12-03 2021-02-26 广州云岫信息科技有限公司 一种基于人工智能的故障根因推断定位方法及装置

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112416645A (zh) * 2020-12-03 2021-02-26 广州云岫信息科技有限公司 一种基于人工智能的故障根因推断定位方法及装置

Also Published As

Publication number Publication date
CN115733725A (zh) 2023-03-03

Similar Documents

Publication Publication Date Title
US11481242B2 (en) System and method of flow source discovery
US11528283B2 (en) System for monitoring and managing datacenters
CN109644141B (zh) 用于可视化网络的方法和系统
US8270306B2 (en) Fault management apparatus and method for identifying cause of fault in communication network
US10862749B1 (en) Systems for and methods of network management and verification using intent inference
US9886445B1 (en) Datacenter entity information system
CN108173691B (zh) 一种跨设备聚合的方法及装置
WO2021128977A1 (zh) 一种故障诊断方法及装置
EP2052329A2 (en) Techniques for one-way synchronization of routing information among intermediate nodes
US10153949B2 (en) Sub-topology discovery for operating hybrid solutions
WO2021018309A1 (zh) 报文传输路径确定方法、装置及系统、计算机存储介质
WO2017140084A1 (zh) 测试方法及装置
CN112291116A (zh) 链路故障检测方法、装置及网络设备
WO2016062166A1 (zh) 一种网络的操作管理维护oam方法、装置和系统
US20160057043A1 (en) Diagnostic routing system and method for a link access group
US20220263746A1 (en) Method for monitoring running state of peer, apparatus, and storage medium
US20050190752A1 (en) Method and system for locating the incoming port of a MAC address in an Ethernet switch network
CN102143011B (zh) 一种实现网络保护的装置及方法
JP5503600B2 (ja) 故障管理システムおよび故障管理方法
CN109218176B (zh) 一种报文处理的方法及装置
WO2016015606A1 (zh) 一种用于定位端口连接错误的方法和相关设备
WO2023030183A1 (zh) 一种网络故障显示方法及设备
CN108650180B (zh) 堆叠系统及其表项下发方法、装置
US10541914B2 (en) Data packet forwarding method and network device
CN116248479A (zh) 网络路径探测方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22863321

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE