WO2022100108A1 - 处理故障的方法、装置及系统 - Google Patents

处理故障的方法、装置及系统 Download PDF

Info

Publication number
WO2022100108A1
WO2022100108A1 PCT/CN2021/103583 CN2021103583W WO2022100108A1 WO 2022100108 A1 WO2022100108 A1 WO 2022100108A1 CN 2021103583 W CN2021103583 W CN 2021103583W WO 2022100108 A1 WO2022100108 A1 WO 2022100108A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
fault
relationship
type
abnormal
Prior art date
Application number
PCT/CN2021/103583
Other languages
English (en)
French (fr)
Inventor
陈贤松
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21890639.4A priority Critical patent/EP4181475A4/en
Publication of WO2022100108A1 publication Critical patent/WO2022100108A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/064Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0894Policy-based network configuration management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring

Definitions

  • the present application relates to the field of communications, and in particular, to a method, device and system for handling faults.
  • Communication networks often include multiple network elements, and network elements may fail during operation. Therefore, troubleshooting is an important task in the operation and maintenance of communication networks.
  • an element management system can be deployed in the communication network, and the EMS can communicate with at least one network element.
  • a technician can manually configure a fault handling rule for the EMS, so that the EMS can handle the fault that occurs in the at least one network element through the fault handling rule.
  • the fault handling rules can only be manually configured for each EMS based on experience.
  • the fault handling rules configured by each EMS cannot be learned and updated, resulting in the configuration of the EMS fault handling rules. Not real time.
  • the present application provides a method, device and system for processing faults, so as to improve the efficiency and real-time performance of configuring fault processing rules.
  • the technical solution is as follows:
  • the present application provides a method for handling faults.
  • the operation support system OSS receives fault information sent by at least one network element management system EMS, where the fault information includes a type information set and a first relationship set. , the root cause alarm information of the fault, the processing suggestion information for the fault, and the time information of multiple abnormal events caused by the fault, the type information set includes the type information of the multiple abnormal events, and the first relationship set includes the multiple abnormal events. relationship between anomalous events.
  • the OSS aggregates fault information including the same type of information set, root cause alarm information, and processing suggestion information into a fault information set, and obtains the second relationship set by intersecting the first relation set in each piece of fault information in the fault information set, And the time length is determined according to the time information in each piece of fault information, where the time length is the duration of the abnormal event caused by the fault.
  • the OSS sends a fault processing rule to at least one EMS, where the fault processing rule includes the second relationship set, the time length, the type information set in the first fault information, the root cause alarm information, and the processing suggestion information, and the first fault information is the fault Any piece of fault information in the information set, the fault handling rule is used to instruct at least one EMS to handle the fault.
  • the OSS can obtain the second relation set by taking the intersection of the first relation set included in each fault information in the fault information set, and based on the fault information The time information included in each fault information in the set determines the time length, so that a fault processing rule including the second relation set and the time length can be obtained.
  • the OSS can receive fault information sent by different EMSs, and generate fault handling rules based on the fault information of each EMS, the data required for generating fault handling is enriched, which not only makes the generated fault handling rules learnable and updateable, but also improves the The efficiency of generating fault handling rules, the OSS sends the fault handling rules to at least one EMS, thereby improving the efficiency and real-time performance of configuring fault handling rules in the EMS.
  • the relationship between the plurality of abnormal events is represented by at least one piece of type relationship data, and each piece of type relationship data includes type information of two abnormal events in the plurality of abnormal events and the two abnormal events.
  • the fault information further includes object information of an object where each abnormal event in the plurality of abnormal events is located; the relationship between the plurality of abnormal events is represented by at least one piece of object relationship data, and each abnormal event is represented by at least one piece of object relationship data.
  • the piece of object relationship data includes object information of two objects that have a relationship and the relationship, and the object where each abnormal event is located includes the two objects.
  • the OSS obtains the type relationship data set corresponding to the fault information according to the at least one piece of object relationship data, the object information and type information of each abnormal event, the type relationship data set includes at least one piece of type relationship data, and each type relationship data includes the type relationship data set.
  • the OSS obtains the second relation set by taking the intersection of the type relation data set corresponding to each piece of fault information in the fault information set.
  • the OSS uses the at least one piece of object relationship data, the The object information and type information can not only generate fault handling rules, but also implement other applications based on the at least one piece of object relationship data and the object information of each abnormal event. For example, the OSS may generate an object topology map based on the at least one piece of object relationship data, or determine an object that needs maintenance based on the at least one piece of object relationship data and object information of each abnormal event.
  • the time information of the multiple abnormal events is the time span of the generation time of the multiple abnormal events.
  • the OSS selects the maximum time span from the time spans in each fault message as the time length. Since the time information in the fault information is a time span, the data volume of the fault information can be reduced. In addition, the OSS selects the largest time span as the time length, so that the time length can cover all abnormal events caused by the fault. .
  • the time information of the plurality of abnormal events is the generation time of each abnormal event in the plurality of abnormal events.
  • the OSS obtains the time span of the occurrence time of the abnormal event corresponding to each piece of fault information.
  • the OSS selects the largest time span from the obtained time spans as the time length. Since the fault information includes the occurrence time of each abnormal event, the OSS can not only obtain the time length in the fault processing rule, but also implement other applications based on the occurrence time of the abnormal event in each fault information.
  • the occurrence time of the abnormal event in the fault information can be displayed for the operation and maintenance personnel to watch, or, based on the occurrence time of the abnormal event in the fault information, the EMS can determine the process of root cause location and/or determine the object that needs to be repaired.
  • the types of the multiple abnormal events include an alarm type, a performance limit violation type, and/or a network element abnormal log type.
  • the OSS generates an object topology map based on the at least one piece of object relationship data, and displays the object topology map for the operation and maintenance personnel to view.
  • the present application provides a method for handling faults.
  • the network element management system EMS acquires information about multiple first abnormal events caused by the first fault and reported by at least one network element managed by the EMS. , and obtain the root cause alarm information and processing suggestion information of the first fault, where the first abnormal event information includes type information, generation time, and object information of the object where the first abnormal event is located.
  • the EMS acquires a first relationship set based on the object information of each first abnormal event, where the first relationship set includes relationships among the plurality of first abnormal events.
  • the EMS sends first fault information to the operation support system OSS, where the first fault information includes a type information set, first time information, a first relationship set, the root cause alarm information, and the processing suggestion information, and the type information set includes each first Type information of the abnormal event, the first time information includes the generation time of the multiple first abnormal events or the time span of the generation time of the multiple first abnormal events, and the first fault information is used by the OSS to generate the first fault handling rule, The first fault handling rule is used to instruct at least one EMS that receives the first fault handling rule to handle the first fault.
  • the first fault processing rule includes the second relationship set, the first time length, the type information set in any piece of the first fault information in the fault information set, the root cause alarm information and the processing suggestion information
  • the first fault information set includes Multiple pieces of first fault information received by the OSS, and each piece of first fault information in the multiple pieces of first fault information includes the same set of type information, root cause alarm information, and processing suggestion information
  • the second relationship set is the set of first fault information received by the OSS.
  • the intersection of the first relationship sets in the multiple pieces of first fault information, the first time length is obtained by the OSS based on the first time information in the multiple pieces of first fault information, and the first time length is the first time length caused by the occurrence of the first fault. The duration of an anomalous event.
  • the OSS can obtain the second relationship by intersecting the first relationship sets included in the first fault information in the first fault information set. set, and determine the first time length based on the time information included in each first fault information in the first fault information set, so that a fault processing rule including the second relationship set and the first time length can be obtained. Since the OSS can receive the first fault information sent by different EMSs, and generate fault handling rules based on the first fault information of each EMS, the data required for generating fault handling is enriched, and the generated fault handling rules are learnable and updated. It also improves the efficiency of generating fault processing rules. The OSS sends the fault processing rules to the EMS, thereby improving the efficiency and real-time performance of configuring fault processing rules in the EMS.
  • the relationship between the plurality of first abnormal events is represented by at least one piece of type relationship data, and each piece of type relationship data includes two first abnormal events in the plurality of first abnormal events.
  • the EMS acquires the first relationship set based on the network topology map and/or the object information and type information of each first abnormal event. Since the relationship between the abnormal events in the first relationship set acquired by the EMS is represented by type relationship data, the OSS takes the intersection of the first relationship sets included in each fault information in the first fault information set, thereby improving the generation of fault processing. the efficiency of the rules.
  • the relationship between the multiple first abnormal events is represented by at least one piece of object relationship data
  • each piece of object relationship data includes object information of two objects that have a relationship and the relationship
  • the multiple objects The objects in which each of the first abnormal events is located include the two objects
  • the EMS includes a network topology map.
  • the EMS acquires the first relationship set based on the network topology map and/or the object information of each first abnormal event. Since the first relationship set acquired by the EMS includes at least one piece of object relationship data, the OSS can use the first relationship set to not only generate fault handling rules, but also implement other applications based on the first relationship set. For example, the OSS may generate an object topology map based on the first relationship set, or determine an object that needs to be repaired based on the first relationship set.
  • the first fault information further includes object information of each first abnormal event.
  • the OSS can generate a fault handling rule by using the first relationship set and the object information and type information of each abnormal event.
  • the types of the plurality of first abnormal events include an alarm type, a performance limit violation type, and/or a network element abnormality log type.
  • the EMS receives the first fault processing rule, and obtains, based on the first fault processing rule, multiple second abnormal event information caused by the second fault reported by the at least one network element, the first fault and The second fault is a fault of the same type, and the second abnormal event information includes type information, generation time, and object information of the object where the second abnormal event is located.
  • the EMS acquires a third relationship set based on the object information of each second abnormal event, and the third relationship set includes the relationship between the plurality of second abnormal events.
  • the EMS sends the second fault information to the OSS, where the second fault information includes the second time information, the third relationship set, the type information set, the root cause alarm information, and the processing suggestion information, and the second time information includes the plurality of second time information.
  • the OSS may be used to generate a second fault processing rule based on the second fault information
  • the second fault processing rule includes a fourth relation set, a second time length, the type information set, the root cause alarm information, and the processing suggestion information
  • the fourth relationship set is the intersection of the third relationship set in the multiple pieces of second fault information received by the OSS, and the multiple pieces of second fault information all include the type information set, the root cause alarm information, and the processing suggestion information.
  • the time length is obtained by the OSS based on the second time information in the plurality of pieces of second fault information, and the second time length is the duration of the second abnormal event caused by the second fault.
  • the EMS receives the second fault processing rule, and updates the first fault processing rule to the second fault processing rule.
  • the OSS can generate a fault handling rule based on the second fault information of different EMSs and send the fault handling rule to the EMS, so that the EMS can update the fault handling rule in time, so that the fault handling rule is real-time, and the update efficiency is improved.
  • the present application provides an apparatus for handling faults, which is used to execute the method executed by the OSS in the first aspect or any possible implementation manner of the first aspect.
  • the apparatus includes a unit for performing the first aspect or the method performed by the OSS in any possible implementation manner of the first aspect.
  • the present application provides an apparatus for handling faults, for performing the method performed by the EMS in the second aspect or any possible implementation manner of the second aspect.
  • the apparatus includes means for performing the second aspect or the method performed by the EMS in any one of the possible implementations of the second aspect.
  • the present application provides an apparatus for handling faults, the apparatus including a transceiver, a processor and a memory.
  • the transceiver, the processor and the memory may be connected through an internal connection.
  • the memory is used to store a program
  • the processor is used to execute the program in the memory and cooperate with the transceiver, so that the apparatus performs the method executed by the OSS in the first aspect or any possible implementation manner of the first aspect .
  • the present application provides an apparatus for handling faults, the apparatus including a transceiver, a processor and a memory.
  • the transceiver, the processor and the memory may be connected through an internal connection.
  • the memory is used to store a program
  • the processor is used to execute the program in the memory and cooperate with the transceiver, so that the apparatus performs the method performed by the EMS in the second aspect or any possible implementation of the second aspect .
  • the present application provides a computer program product, the computer program product includes a computer program stored in a computer-readable storage medium, and the computer program is loaded through a device to implement the first aspect and the second aspect above. Instructions for a method of the aspect, any possible implementation of the first aspect, or any possible implementation of the second aspect.
  • the present application provides a computer-readable storage medium for storing a computer program, and the computer program is loaded by a device to execute the first aspect, the second aspect, any possible implementation manner of the first aspect or Instructions for the method of any possible implementation of the second aspect.
  • the present application provides a system for handling faults, the system includes the device described in the third aspect and the device described in the fourth aspect, or the system includes the device described in the fifth aspect and the third aspect.
  • the device described in the sixth aspect includes the device described in the eighth aspect.
  • FIG. 1 is a schematic diagram of a network architecture provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a method for handling faults provided by an embodiment of the present application
  • FIG. 3 is a flowchart of another method for handling faults provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an apparatus for handling faults provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of another apparatus for handling faults provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of another apparatus for handling faults provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of another apparatus for handling faults provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a system for handling faults provided by an embodiment of the present application.
  • an embodiment of the present application provides a network architecture, including:
  • An operations support system (OSS), multiple EMSs and multiple network elements, each EMS communicating with the OSS, and each EMS communicating with one or more network elements.
  • the OSS is configured to generate at least one fault handling rule, and send the at least one fault handling rule to the EMS that communicates with the OSS.
  • the fault processing rule is used for processing faults generated by network elements, and the faults processed by each fault processing rule may be different.
  • the fault processing rule includes a set of type information, a set of relational data of the first type, root cause alarm information, processing suggestion information and time length.
  • the type information set includes multiple types of information, and the multiple types of information are type information of multiple abnormal events caused by the fault.
  • the time span of the occurrence time of each abnormal event caused by the fault is less than or equal to the time length.
  • the first type relationship data set includes at least one piece of type relationship data, and each piece of type relationship data includes two types of information in the type information set and a relationship between the two types of information.
  • the root cause alarm information includes the type information of the root cause alarm event.
  • the root cause alarm event is an abnormal event directly caused by the fault, and the root cause alarm information can be used to reflect the cause of the fault.
  • the handling suggestion information is used to instruct the EMS or OSS how to handle the fault.
  • the EMS For each EMS, and for one or more network elements in communication with the EMS, the EMS is configured to process the failure of the one or more network elements according to the at least one failure handling rule, so that the failed network Yuan returned to normal.
  • the network element may generate an abnormal event when a failure occurs, and other network elements associated with the network element may also be affected by the failure and generate an abnormal event.
  • the service path between the two terminals passes through multiple network elements, and the multiple network elements that the service path passes through are related to each other.
  • One of the network elements fails and an abnormal event occurs, and other network elements also May be affected, or abnormal events may occur.
  • the faulty network element is referred to as the first network element.
  • the first network element When the first network element fails, the first network element generates and reports at least one abnormal event.
  • the at least one abnormal event is directly or indirectly caused by the fault.
  • the failure of the first network element may be that a certain component in the first network element is faulty.
  • the abnormal event generated by the first network element includes at least one alarm event, and the at least one alarm event includes an alarm event directly caused by the fault, and may also include an alarm event indirectly caused by the fault.
  • the alarm event directly caused by the fault can also be called the root cause alarm event, and the root cause alarm event can be used to reflect the cause of the fault.
  • the abnormal event generated by the first network element may further include at least one performance exceeding event and/or at least one abnormal log event of the network element.
  • the performance exceeding the limit event is an event corresponding to the performance parameter generated when the parameter value of the performance parameter of the first network element exceeds the parameter value threshold.
  • the so-called parameter value of the performance parameter exceeds the parameter value threshold means that the parameter value of the performance parameter is greater than the parameter value threshold, or the parameter value of the performance parameter is smaller than the parameter value threshold.
  • the alarm event directly generated by the first network element is a port alarm event, that is, the port alarm event is an alarm event directly caused by the failure, so the port alarm event is the root alarm event. due to an alarm event. Due to the failure of the port, the link passing through the port may not be able to carry services, and the first network element may also generate a link failure alarm event.
  • the link failure alarm event is an alarm event indirectly caused by the failure.
  • a rate violation event is a performance violation event.
  • the first network element may generate a log of the failure, and the log generated by the first network element may include a port exception log, that is, the first network element may generate a port exception log event, and the port exception log event is a network exception log event. Meta exception log events. Therefore, when the port in the first network element is faulty, the abnormal events generated by the first network element include port alarm events, link failure alarm events, transmission rate exceeding events, and port exception log events.
  • the first network element may generate one log or multiple logs. Each log generated by the first network element may correspond to one network element abnormal log event, so the first network element may generate one or more network element abnormal log events.
  • the first network element may also generate a link abnormality log, that is, the first network element may also generate a link abnormality log event.
  • the first network element may generate a link abnormal log event, the link The abnormal log events correspond to the multiple links, or the first network element may also generate multiple link abnormal log events, and each link abnormal log event corresponds to a link.
  • the other network elements may also be affected by the failure of the first network element, and at least one abnormal event may also be generated.
  • Other network elements that communicate with the first network element and are affected by the failure of the first network element may be referred to as second network elements.
  • the second network element After receiving the data sent by the first network element, the second network element detects that the faulty port in the first network element is blocked, and the second network element generates a remote port blocking alarm event. Since the second network element cannot receive the data sent by the first network element through the link, the second network element may generate a log to record the situation, and the log generated by the second network element may include the link abnormality log, that is, the second network element The element generates link exception log events.
  • the second network element may generate one log, or may generate multiple logs.
  • Each log generated by the second network element may correspond to one network element abnormal log event, so the second network element may generate one or more network element abnormal log events.
  • the first network element when the fan of the first network element fails, the first network element generates a fan failure alarm event, and the fan failure alarm event is an abnormal event directly caused by the fan failure, and is a root cause alarm event.
  • the operating temperature of the board of the first network element rises.
  • the first network element When the operating temperature of the board exceeds the temperature threshold, the first network element generates a working temperature over-limit event (over the limit means exceeding the limit, which can be greater than the maximum limit or less than the minimum limit).
  • the operating temperature limit violation event is an abnormal event indirectly caused by the fan failure.
  • the first network element may also generate a log of the fan failure, so that the first network element may generate a fan abnormal log event, and the fan abnormal log event is also an abnormal event indirectly caused by the fan failure.
  • the bit error rate of the data received by the second network element through the port becomes higher.
  • the bit error rate threshold the second network element generates a bit error rate exceeding event.
  • the second network element may also generate other abnormal events, such as alarm events and/or network element abnormal log events.
  • the second network element may also generate an abnormal bit error rate log, so that the second network element generates an abnormal bit error rate log event.
  • the network element For each network element, after the abnormal event is generated, the network element sends event information of the abnormal event to the EMS that communicates with the network element.
  • the event information of the abnormal event includes the type information of the abnormal event, the generation time of the abnormal event, and the object information of the object where the abnormal event is located.
  • the type information of the abnormal event includes the event type to which the abnormal event belongs (such as an alarm type, a performance overrun type or a network element abnormal log type, etc.) and a subtype of the event type that the abnormal event belongs to.
  • the event information of the abnormal event may also include other information.
  • the event information of the network element abnormal log event may further include log content corresponding to the network element abnormal log event.
  • the event name of the abnormal event of the alarm type is the alarm event.
  • the alarm type includes at least one subtype.
  • the subtypes belonging to the alarm type include port alarm type, link failure alarm type, remote port blocking alarm type, and/or fan failure alarm type.
  • the abnormal events of the alarm type include events corresponding to the subtypes of the alarm type, such as port alarm events (corresponding to the port alarm type), link failure alarm events (corresponding to the link failure alarm type), and remote port blocking alarm events (corresponding to the port alarm type). remote port blocking alarm type) and/or fan failure alarm event (corresponding to the fan failure alarm type), etc.
  • the event name of the abnormal event of the performance limit violation type is the performance limit violation event.
  • the performance limit violation type includes at least one subtype, for example, the subtypes belonging to the performance limit violation type include the transmission rate limit violation type and/or the bit error rate limit violation type.
  • Performance violation events of the performance violation type include events corresponding to subtypes of the performance violation type, such as transmission rate violation events (corresponding to transmission rate violation types) and/or bit error rate violation events (corresponding to error rate violations). rate over-limit type) and so on.
  • the event name of the abnormal event of the NE exception log type is the NE exception log event.
  • the network element abnormality log type includes at least one subtype, for example, the subtypes belonging to the network element abnormality log type include link abnormality log type, bit error rate abnormality log type, and/or fan abnormality log type.
  • the abnormal events of the NE abnormal log type include events corresponding to the sub-types of the NE abnormal log type, such as link abnormal log events (corresponding to the link abnormal log type), bit error rate abnormal log events (corresponding to the abnormal bit error rate) log type) and/or fan exception log event (corresponding to the fan exception log type), etc.
  • the object where the abnormal event is located is the object that generated the abnormal event, and the object information of the object includes the object identifier of the object and/or the network element identifier of the network element where the object is located, and may also include description information of the object.
  • the object may be a network element, a component in the network element, or the endpoint of a link established on the network element on the network element side.
  • the object of the port alarm event is the port in the first network element
  • the object information of the object includes the port identifier of the port and the network element of the first network element. logo.
  • the object information of the object may also include description information of the port, and the description information can be used to describe the position of the port in the first network element.
  • the description information includes the slot position of the port in the first network element.
  • the EMS receives event information of multiple abnormal events sent by at least one network element in communication with it, generates fault information based on the event information of the multiple abnormal events, and sends the fault information to the OSS, so that the OSS can generate the fault information based on the fault information.
  • Troubleshooting rules The detailed process of the EMS generating fault information and the OSS generating fault processing rules will be described in detail in subsequent embodiments.
  • an embodiment of the present application provides a method for handling faults, and the method can apply the network architecture shown in FIG. 1 .
  • the EMS generates fault information
  • the fault information includes a set of type information and object relationship data. Aggregate, root cause alarm information, and processing suggestion information.
  • the OSS receives the fault information sent by the EMS, and generates fault processing rules based on the received fault information.
  • the method includes:
  • Step 201 The EMS receives event information of multiple abnormal events sent by at least one network element.
  • Each of the at least one network element communicates with the EMS. For example, there is a network connection between each of the network elements and the EMS.
  • the network element is referred to as the first network element for the convenience of description.
  • the first network element may generate at least one abnormal event
  • the second network element The element may also generate at least one abnormal event
  • the second network element is another network element other than the first network element in the at least one network element.
  • the first network element sends the event information of each abnormal event generated by the first network element to the EMS, and the second network element also sends the event information of each abnormal event generated by the second network element to the EMS.
  • the EMS receives event information of a plurality of abnormal events, where the plurality of abnormal events include at least one alarm event.
  • the event information of the alarm event includes the type information of the alarm event, the generation time of the alarm event and the object information of the object where the alarm event is located.
  • the type information of the alarm event includes the alarm type and the subtype of the alarm type to which the alarm event belongs, the object where the alarm event is located is the object that generated the alarm event, and the object information of the object includes the object identifier of the object and the location of the object.
  • the network element identifier of the network element and may also include the description information of the object.
  • the event information of the alarm event may further include information such as an event identifier and/or an alarm level of the alarm event, and the event identifier may identify the alarm event in the EMS.
  • the alarm events generated by the first network element include port alarm events and link alarm events.
  • the fault alarm event, the alarm event generated by the second network element includes the remote port blocking alarm event. Therefore, referring to Table 1 below, the event information of the alarm event received by the EMS includes the event information of the port alarm event, the event information of the link failure alarm event and the event information of the remote port blocking alarm event.
  • the port P1 corresponding to the port identifier Port1 is the faulty port in the first network element.
  • the port P2 corresponding to the port identifier Port2 is a port in the second network element that is connected to the faulty port P1.
  • the first link endpoint is the endpoint of link 1 on the side of the first network element, and link 1 is the link established between the faulty port P1 of the first network element and the port P2 of the second network element.
  • the link of road 1 is identified as Link1
  • the endpoint of the first link endpoint is identified as Link1-1 and is located in the faulty port P1.
  • the object where the port alarm event is located is port P1
  • the object where the link failure alarm event is located is the first link endpoint
  • the object where the remote port blocking event is located is port P2.
  • the multiple abnormal events may also include at least one performance limit violation event, and for any performance limit violation event, the event information of the performance limit violation event includes the type information of the performance limit violation event, the generation time of the performance limit violation event and the Object information of the object where the performance limit violation event is located.
  • the type information of the performance limit violation event includes a performance limit violation type and a subtype of the performance limit violation type to which the performance limit violation event belongs.
  • the event information of the performance limit violation event may also include information such as the event identifier of the performance limit violation event and/or the parameter value of the performance parameter, and the event identifier can identify the performance limit violation event in the EMS.
  • the performance over-limit event generated by the first network element includes a transmission rate over-limit event. Therefore, referring to Table 2 below, the event information of the performance limit violation event received by the EMS includes the event information of the transmission rate limit violation event, and the object of the transmission rate limit violation event is also port P1.
  • the multiple abnormal events may also include at least one network element abnormal log event.
  • the event information of the network element abnormal log event includes the type information of the network element abnormal log event. The generation time of the event and the object information of the object where the NE exception log event is located.
  • the type information of the network element abnormality log event includes the network element abnormality log type and the subtype of the network element abnormality log type to which the network element abnormality log event belongs.
  • the event information of the network element abnormal log event may further include an event identifier and/or log content of the network element abnormal log event, and the event identifier can identify the network element abnormal log event in the EMS.
  • the event information of the network element abnormal log event received by the EMS includes the event information of the port abnormal log event and the event information of the link abnormal log event.
  • the object of the port abnormal log event is port P1, and the link abnormal log event is located.
  • the object where the event is located is the second link endpoint, the second link endpoint is the endpoint of the link 1 on the second network element side, and the second link endpoint is located in the port P2 of the second network element.
  • Step 202 The EMS determines the event information of at least one abnormal event caused by the same fault according to the event information of the multiple abnormal events and at least one fault processing rule in the EMS.
  • the EMS may start to perform this step when receiving the event information of the abnormal event, or perform this step periodically.
  • the event information of the multiple abnormal events used by the EMS in this step includes the event information of the newly received abnormal event, and may also include the event information of the abnormal event that was excluded when the fault information was generated before, but among the multiple abnormal events.
  • the time difference between the occurrence time of each abnormal event and the current time is less than the specified time difference threshold.
  • the EMS stores a network topology map and at least one fault processing rule, and the at least one fault processing rule may include a fault processing rule received by the EMS and sent by the OSS and/or a fault processing rule configured in the EMS.
  • the fault handling rules sent by OSS are generated by OSS.
  • the EMS may determine the event information of at least one abnormal event caused by the same fault through the following operations from 2021 to 2025.
  • the 2021 to 2025 operations are:
  • the EMS selects one fault processing rule from the at least one fault processing rule as a target fault processing rule.
  • the target fault processing rule includes a set of type information, a set of first-type relational data, root cause alarm information, processing suggestion information and time length.
  • the type information set includes at least one subtype set, the at least one subtype set includes an alarm type set corresponding to the alarm type, and the alarm type set includes at least one subtype belonging to the alarm type.
  • the at least one subtype set may further include a performance limit violation type set corresponding to the performance limit violation type and/or a network element exception log type set corresponding to the network element exception log type.
  • the performance violation type set includes at least one subtype belonging to the performance violation category
  • the network element exception log type set includes at least one subtype belonging to the network element exception log type.
  • the first type relationship data set includes at least one piece of type relationship data, and each piece of type relationship data includes two types of information and a relationship between the two types of information.
  • the type relationship data specifically includes two subtypes and a relationship between the two subtypes, and the two subtypes may belong to the same event type, or belong to different event types.
  • ⁇ port alarm type, link failure alarm type, inclusion relationship> indicates that the port alarm type and link failure alarm type that belong to the same alarm type, and the relationship between these two alarm types is an inclusion relationship.
  • ⁇ port alarm type, transmission rate violation type, same NE relationship> Indicates that the relationship between port alarm type, transmission rate violation type and these two subtypes is the same NE relationship, port alarm type and transmission rate The limit violation types belong to different event types.
  • the root cause alarm information includes type information of the root cause alarm event.
  • the root cause alarm information includes the subtype of the root cause alarm event.
  • the fault processing rules shown in Table 4 are stored in the EMS.
  • the type information set in the fault processing rule includes three sub-type sets, and the three sub-type sets are respectively an alarm type set, a performance over-limit type set, and a network element abnormal log type set.
  • the subtypes included in the alarm type set are port alarm type, link failure alarm type and remote port blocking alarm type.
  • the subtypes included in the set of performance overrun types are transmission rate overrun types.
  • the subtypes included in the NE exception log type set are port exception log type and link exception log type.
  • the type relationship data included in the first type relationship data set in the fault handling rule is ⁇ port alarm type, link failure alarm type, relationship with network element>, ⁇ port alarm type, transmission rate violation type, relationship with network element >,.
  • the root cause alarm information in the fault handling rule includes the port alarm type, which is a subtype of the root cause alarm event, and the root cause alarm event is a port alarm event.
  • the processing recommendation information in this fault processing rule is the port where the root cause alarm event is located.
  • the length of time in this fault handling rule is 50 seconds.
  • the EMS determines a first set according to the event information of each abnormal event in the multiple abnormal events and the type information set included in the target fault processing rule, and the first set includes the exception corresponding to each type of information in the type information set event.
  • the event information of each abnormal event in the plurality of abnormal events includes type information
  • the target fault handling rule includes a type information set.
  • the process of determining the first set may be: determining whether there is an abnormal event corresponding to each type of information included in the type information set in the plurality of abnormal events, and if so, acquiring the first set, the first set includes The abnormal event corresponding to each type of information in the type information set; if it does not exist, return to 2021 to reselect a fault processing rule as the target fault processing rule.
  • each subtype set includes at least one subtype
  • the type information of each abnormal event in the plurality of abnormal events also includes subtypes
  • determine the first set The process may be: determine whether there is an abnormal event corresponding to each subtype included in the type information set in the plurality of abnormal events, and if so, obtain the first set, and the first set includes each subtype in the type information set the corresponding abnormal event.
  • abnormal events there may be abnormal events that do not belong to the first set.
  • this part of the abnormal events can be used when generating fault information next time.
  • the EMS receives the six abnormal events shown in Table 1, Table 2, and Table 3 above.
  • the six abnormal events are port alarm events, link failure alarm events, remote port blocking alarm events, and transmission rate limit violations respectively.
  • Events, port exception log events, link exception log events, and EMS store target fault processing rules as shown in Table 4.
  • the type information set in the target fault processing rules includes six subtypes, which are port alarm type, Link fault alarm type, remote port blocking alarm type, transmission rate violation type, port abnormality log type, and link abnormality log type.
  • the six abnormal events shown in Table 1, Table 2 and Table 3 include the port alarm event corresponding to the port alarm type, the link failure alarm event corresponding to the link failure alarm type, and the remote port blocking alarm type.
  • End port blocking alarm events transmission rate violation events corresponding to the transmission rate violation type, port exception log events corresponding to the port exception log type, and link exception log events corresponding to the link exception log type.
  • the first set thus obtained includes port alarm events, link failure alarm events, remote port blocking alarm events, transmission rate limit violation events, port abnormality log events, and link abnormality log events.
  • the EMS determines the second set according to the first set and the time length included in the target fault handling rule.
  • the second set is a subset of the first set, and the second set includes a time span of occurrence times of abnormal events that is less than or equal to the time length included in the target fault handling rule.
  • the time span is equal to the time difference between the generation time of the earliest generated abnormal event and the generation time of the latest generated abnormal event in the second set.
  • the second set is a subset of the first set, including the case where the second set belongs to the first set, or the second set is the same as the first set.
  • step 2023 the time span of the occurrence time of the abnormal events in the first set is obtained, and if the time span is less than or equal to the time length included in the target fault handling rule, the first set is regarded as the second set. If the time span is greater than the time length included in the target fault handling rule, the earliest abnormal event or the latest abnormal event is removed from the first set. Then, the time span of the generation time of the remaining abnormal events in the first set is obtained. If the obtained time span is less than or equal to the time length included in the target fault handling rule, the remaining abnormal events in the first set are formed into a second set.
  • the obtained time span is greater than the time length included in the target fault handling rule, continue to remove the earliest abnormal event or the latest abnormal event from the remaining abnormal events in the first set, and continue to repeat the above process until the remaining abnormal events in the first set are
  • the time span of the occurrence time of the abnormal event is less than or equal to the time length in the target fault handling rule.
  • the earliest abnormal event in the first set is the port abnormal event
  • the latest abnormal event is the link abnormal log event.
  • T6-T1 the generation time of the port abnormal event
  • T6-T6 the generation time of the six abnormal events in a set
  • the first set is taken as the second set
  • the second set also includes the six abnormal events, that is, as shown in Table 5,
  • the second set includes port alarm events and link failure alarms Events, remote port blocking alarm events, transmission rate exceeding events, port exception log events, and link exception log events.
  • the second set contains abnormal events port alarm event Link failure alarm event Remote port blocking alarm event Transmission rate limit violation event Port exception log events Link exception log events
  • the EMS When the second set includes an abnormal event corresponding to each type of information in the type information set in the target fault processing rule, the EMS, according to the network topology map and/or the event information of each abnormal event in the second set, Gets a collection of relational data of the second type.
  • the second type relationship data set includes at least one piece of type relationship data, each piece of type relationship data includes two types of information and a relationship between the two types of information, and the two types of information are two exceptions in the second set Type information for the event.
  • the second set does not include an abnormal event corresponding to one or more types of information in the target fault processing rule, return to 2021 to reselect a fault processing rule as the target fault processing rule.
  • the two abnormal events are referred to as the first abnormal event and the second abnormal event, according to the object identifier of the first object and the network element where it is located.
  • the identifier, the object identifier of the second object and the identifier of the network element where it is located, and/or the network topology map determines whether there is at least one specified relationship between the first object and the second object.
  • the first object is the object where the first abnormal event is located
  • the second object is the object where the second abnormal event is located.
  • the relationship is also the type information of the first abnormal event The relationship with the type information of the second abnormal event.
  • a piece of type relationship data and a piece of object relationship data can be obtained, the type relationship data includes the type information of the first abnormal event, the type information of the second abnormal event and the relationship, and the object relationship data includes the object information of the first object, Object information of the second object and the relationship.
  • the type relationship data specifically includes the subtype of the first abnormal event, the subtype of the second abnormal event, and the relationship
  • the object relationship data specifically includes the object identifier of the first object, and the network element of the network element where the first object is located.
  • the EMS selects a specified relationship with the highest or lowest priority from the at least one specified relationship as the relationship between the first object and the second object.
  • the specified relationship may be an inclusion relationship, a bearer relationship, a same-service relationship, a neighbor relationship, or a same-network element relationship, and the like.
  • the priorities of the inclusion relationship, the bearer relationship, the same service relationship, the neighbor relationship, and the same network element relationship are defined, and the priorities of each relationship are different.
  • an example of the priority size of each relationship is listed next, and the example is: Inclusion relationship > Bearer relationship > Same service relationship > Neighbor relationship > Priority of the same network element relationship; the priority size of each relationship is in addition to the above
  • the so-called inclusion relationship means that the first object includes the second object or the second object includes the first object.
  • the first object is a board
  • the second object is a port in the board
  • the first object includes the second object.
  • the so-called bearer relationship means that the first object is located on the second object or the second object is located on the first object, for example, the first object is a physical link
  • the second object is a logical link
  • the second object is located on the first object.
  • the so-called same business relationship means that the first object and the second object are used to transmit the same business.
  • the so-called neighbor relationship refers to that the first object and the second object belong to two different network elements and the two network elements are directly connected.
  • the so-called same network element relationship means that the first object and the second object are located in the same network element.
  • the two abnormal events are a port alarm event and a link failure alarm event
  • the object of the port alarm event is port P1
  • the port P1 is located on the first network element
  • the port The object identifier of the alarm event is the port identifier Port1
  • the object where the link failure alarm event is located is the first link endpoint
  • the first link endpoint is also located on the first network element
  • the object identifier of the link failure alarm event is the endpoint identifier Link1- 1.
  • the EMS determines the port according to the object identification Port1 of the port P1 and the first network element identification NE1, the object identification Link1-1 of the first link endpoint and the first network element identification NE1, and/or the network topology diagram. Whether there is at least one specified relationship between P1 and the first link endpoint.
  • the port P1 is located on the first network element, and the first link endpoint is located in the port P1, so there is an inclusion relationship and the same network element relationship between the port P1 and the first link endpoint, and the inclusion relationship with the highest priority is selected as the port P1 and the first link endpoint.
  • a piece of type relationship data and a piece of object relationship data can be obtained.
  • the type relationship data is ⁇ port alarm type, link failure alarm type, including relationship>
  • the object relationship data is ⁇ object information of port P1, the first link Object information for endpoints, including relationships >.
  • the EMS determines whether there is at least one specified relationship between the first object and the second object by performing the following operations (1) to (4).
  • the operations (1) to (4) can be:
  • EMS obtains the type information of the first abnormal event from the event information of the first abnormal event, the object identifier of the first object and the network element identifier where it is located, and obtains the information of the second abnormal event from the event information of the second abnormal event.
  • the event information of the port alarm event includes the port alarm type, the port identifier Port1 of the port P1 (the port P1 is the object , the port identifier Port1 is the object identifier), and the network element identifier NE1 where the port P1 is located;
  • the event information of the link failure alarm event includes the link failure alarm type, the endpoint identifier Link1-1 of the first link endpoint (the first link The endpoint is an object, the endpoint identifier Link1-1 is an object identifier), and the network element identifier NE1 where the first link endpoint is located.
  • the EMS obtains the port alarm type of the port alarm event, the port ID Port1 of the port P1, and the NE1 where the port P1 is located from the event information of the port alarm event, and obtains the link from the event information of the link failure alarm event.
  • the event information of the link abnormality log event includes the link abnormality log type, the endpoint of the second link endpoint. Identifies Link1-2 (the second link endpoint is an object, and the endpoint identifier Link1-2 is an object identifier), and the network element identifier NE2 where the second link endpoint is located. Therefore, the EMS obtains the link exception log type from the event information of the link exception log event, the endpoint identifier Link1-2 of the second link endpoint and the network element identifier NE2 where the second link endpoint is located.
  • EMS determines whether the first object and the second object are located in the same network element according to the network element identifier where the first object is located and the network element identifier where the second object is located, and if it is determined to be located in the same network element, perform the following operations ( 3), if it is determined that it is located in a different network element, perform the following operation (4).
  • the EMS determines that the port P1 and the first link endpoint are located in the first network element according to the network element identifier NE1 where the port P1 is located and the network element identifier NE1 where the first link endpoint is located, that is, the two are located in the same network element, and the execution is as follows Operation (3).
  • the EMS determines that the port P1 and the second link endpoint are located in different network elements according to the network element identifier NE1 where the port P1 is located and the network element identifier NE2 where the second link endpoint is located, and performs the following operation (4).
  • the EMS determines at least one specified relationship existing between the first object and the second object according to the object identification of the first object and the object identification of the second object, and the at least one specified relationship may include an inclusive relationship and/or an identical relationship NE relationship, end.
  • the EMS also selects the specified relationship with the highest or the lowest priority as the relationship between the first object and the second object.
  • the object where the port alarm event is located is the port P1
  • the object where the link failure alarm event is located is the first link endpoint.
  • the endpoint identifier Link1-1 of the first link endpoint and the port identifier Port1 of the port P1 it is determined that the port P1 includes the first link endpoint, so it is determined that there is an inclusion relationship and the same network element relationship between the first link endpoint and the port P1 , and the priority of the inclusion relationship is higher than the relationship with the same network element, so the relationship between the first link endpoint and the port P1 is determined to be an inclusion relationship, and an object relationship data is obtained in this way.
  • the object relationship data is ⁇ first link endpoint
  • the object information of the port P1, the object information of the port P1, the inclusion relationship>, and a piece of type relationship data is obtained, and the type relationship data is ⁇ port alarm type, link failure alarm type, inclusion relationship>.
  • EMS determines whether there is at least one between the first object and the second object according to the network topology diagram, the object identifier of the first object and the network element identifier where it is located, and the object identifier of the second object and the network element identifier where it is located.
  • a specified relationship, the at least one specified relationship may include a neighbor relationship, a same-service relationship and/or a bearer relationship.
  • the EMS selects the specified relationship with the highest or lowest priority from the at least one specified relationship as the relationship between the first object and the second object. If there is no specified relationship between the first object and the second object, it indicates that there is no relationship between the first object and the second object, and there is no relationship between the type information of the first abnormal event and the type information of the second abnormal event.
  • the object where the port alarm event is located is the port P1
  • the object where the link abnormality log event is located is the second link endpoint.
  • Port P1 is located in the first network element
  • the second link endpoint is located in port P2 of the second network element.
  • the second link endpoint and the first link endpoint in port P1 are two endpoints of the same link 1, so
  • the EMS determines the port according to the network topology diagram, the port identifier Port1 of the port P1, the network element identifier NE1 of the first network element, the endpoint identifier Link1-2 of the second link endpoint, and the network element identifier NE2 of the second network element.
  • a neighbor relationship exists between P1 and the second link endpoint.
  • the neighbor relationship is determined as the relationship between the port P1 and the second link endpoint, so that an object relationship data is obtained, and the object relationship data is ⁇ the object information of the port P1, the object information of the second link endpoint, the neighbor relationship> ; and, obtain a piece of type relation data, where the type relation data is ⁇ port alarm type, remote port blocking alarm type, neighbor relation>.
  • the second type relational data set includes port alarm events, link failure alarm events, remote port blocking alarm events, transmission rate overrun events, port exception log events, and link exception log events.
  • the second-type relational data set shown in Table 6 below and the object relational data set shown in Table 7 below are obtained.
  • the second type of relational data set ⁇ Port alarm type, link failure alarm type, inclusion relationship> ⁇ Port alarm type, transmission rate violation type, relationship with NE> ⁇ Link Abnormal Log Type, Remote Port Blocking Alarm Type, Inclusion Relationship> ⁇ Port alarm type, remote port blocking alarm type, neighbor relationship> ⁇ Link failure alarm type, remote port blocking alarm type, neighbor relationship> ⁇ Transmission rate violation type, remote port blocking alarm type, neighbor relationship> ⁇ Port exception log type, remote port blocking alarm type, neighbor relationship> ⁇ Port alarm type, link exception log type, neighbor relationship> ⁇ Link failure alarm type, link abnormality log type, related to the business> ⁇ Transmission rate violation type, link exception log type, neighbor relationship> ⁇ Port exception log type, link exception log type, neighbor relationship> ⁇ Link failure alarm type, link abnormality log type, related to the business> ⁇ Transmission rate violation type, link exception log type, neighbor relationship> ⁇ Port exception log type, link exception log type, neighbor relationship> ⁇ Link failure alarm type, link abnormality log type, related to the business> ⁇
  • Object Relational Data Collection ⁇ Object information of port P1, object information of the first link endpoint, including relationship> ⁇ Object information of port P1, object information of port P2, neighbor relationship> ⁇ Object information of port P1, object information of the second link port, neighbor relationship> ⁇ Object information of the first link endpoint, object information of port P2, neighbor relationship> ⁇ The object information of the first link endpoint, the object information of the second link endpoint, the same business relationship> ⁇ Object information of port P2, object information of the second link endpoint, inclusion relationship>
  • the target fault processing rule includes a type information set and a first type relational data set, and the abnormal events included in the second set are abnormal events corresponding to each type of information in the type information set.
  • the second type of relational data set includes the first type of relational data set, it means that each piece of typed relational data in the first type of relational data set can be matched with two abnormal events in the second set.
  • the relational data set of the second type shown in Table 6 includes the relational data set of the first type in the target fault handling rules shown in Table 4.
  • the second set includes ports corresponding to the port alarm types The alarm event and the link failure alarm event corresponding to the link failure alarm type.
  • the second type relational data set shown in Table 6 records the port alarm type of the port alarm event and the link failure alarm of the link failure alarm event.
  • the relationship between the types is also an inclusion relationship, so ⁇ port alarm type, link failure alarm type, inclusion relationship> matches the port alarm events and link failure alarm events in the second set.
  • each other type of relational data in the first type of relational data set matches the abnormal events in the second set, so the event information of the port alarm events and the link failure alarm events in the second set can be determined.
  • the event information, the event information of the remote port blocking alarm event, the event information of the transmission rate exceeding event, the event information of the port exception log event, and the event information of the link exception log event are the event information of the exception events caused by the same fault.
  • Step 203 The EMS sends fault information to the OSS, where the fault information includes the generation time and object information of each abnormal event in the at least one abnormal event, the object relationship data set, the type information set in the target fault processing rule, and the root cause. Alarm information and handling suggestion information.
  • the at least one abnormal event is the abnormal event in the second set obtained in step 202 .
  • the target fault processing rule is the fault processing rule obtained in step 202 that matches the at least one abnormal event, and the object-relational data set is also the object-relational data set obtained in step 202 .
  • the type information included in the type information set in the target fault handling rule is the type information of each abnormal event in the at least one abnormal event.
  • the EMS In step 203, the EMS generates time of each abnormal event in the at least one abnormal event, object information of each abnormal event, the object relationship data set, the type information set in the target fault processing rule, and the root cause alarm information. and processing suggestion information to form fault information, and send the fault information to the OSS.
  • the EMS determines the event information of the port alarm event in the second set, the event information of the link failure alarm event, the event information of the remote port blocking alarm event, the event information of the transmission rate exceeding event, the port
  • the event information of the abnormal log event and the event information of the link abnormal log event are the event information of the abnormal event caused by the same fault, and the object relation data set shown in Table 7 is obtained.
  • the EMS compares the generation time T1 and object information of the port alarm event, the generation time T2 and object information of the link failure alarm event, the generation time T3 and object information of the remote port blocking alarm event, and the transmission rate exceeding the limit.
  • Event generation time T4 and object information Event generation time T4 and object information, port exception log event generation time T5 and object information, link exception log event generation time T6 and object information, object relationship data collection as shown in Table 6, and as shown in Table 4.
  • the type information set, root cause alarm information, and processing suggestion information in the target fault processing rule shown in Table 8 form the fault information shown in Table 8 below.
  • the type information set included in the fault information the generation time and object information of each abnormal event in the at least one abnormal event
  • the type information set, the generation time of each abnormal event and the object information together constitute the event information of each abnormal event. Therefore, the fault information essentially includes the event information of each abnormal event, the object relational data set, the root cause alarm information and the processing suggestion information.
  • Table 8 includes the object information of the port alarm event, the generation time T1 and the port alarm type, the object information of the link failure alarm event, the generation time T2 and the link failure alarm type, Object information, generation time T3 and remote port blocking alarm type of the remote port blocking alarm event, object information, generation time T4 and transmission rate violation type of the transmission rate overrun event, object information and generation time of the port exception log event T5 and port exception log type, as well as the object information of the link exception log event, generation time T6, and link exception log type.
  • the fault information shown in Table 8 essentially includes the event information of the port alarm event, the event information of the link failure alarm event, the event information of the remote port blocking alarm event, the event information of the transmission rate overrun event, and the event information of the port abnormal log event.
  • the event information of the abnormal event may also include other information, the fault information.
  • This other information may also be included.
  • the event information of the port alarm event not only includes the type information, generation time and object information of the port alarm event, but also includes the event identifier Event1 and the alarm level L1, so the faults shown in Table 8
  • the information may also include the event identifier Event1 and the alarm level L1.
  • the fault information further includes a second type of relational data set.
  • the EMS may also acquire one or more of the location information of the fault, the start time of the fault, and the end time of the fault.
  • the fault information may also include one or more of fault location information, fault start time, and fault end time.
  • the fault occurrence location information includes the location of the network element that generates the root cause alarm event.
  • the fault location information may be the geographic location of the network element. For example, it may be the latitude and longitude coordinates of the network element or the name of the location of the network element.
  • the failure start time may be the generation time of the root cause alarm event.
  • the network element will notify the EMS that the failure has disappeared, so the failure end time is the time when the EMS receives the notification that the failure has disappeared.
  • the EMS can also process the network element fault according to the processing suggestion information included in the target fault processing rule. For example, assuming that the target fault processing rule is the fault processing rule shown in Table 4, according to the processing suggestion information "restart the port where the root cause alarm event is located" according to the processing suggestion information of the target fault processing rule, restart the port P1 in the first network element.
  • the EMS may determine event information of abnormal events caused by multiple faults, generate fault information corresponding to each fault according to the event information of abnormal events caused by each fault, and send the fault information corresponding to each fault to the OSS.
  • EMSs may send the generated fault information to the OSS.
  • Step 204 The OSS receives multiple fault information, and among the multiple fault information, aggregates the fault information including the same type of information set, root cause alarm information and processing suggestion information into a fault information set.
  • the type information set, root cause alarm information and processing suggestion information included in each fault information of the multiple fault information may be the same, so the OSS may aggregate the multiple fault information into a fault information set.
  • the type information set, root cause alarm information, and processing suggestion information included in each fault information of the multiple fault information may be different, so the OSS may aggregate the multiple fault information into multiple fault information sets.
  • each fault information set obtained by the OSS corresponds to the fault.
  • the multiple fault information received by the OSS includes the fault information shown in Table 8.
  • the OSS receives the multiple fault information from the In the fault information, each fault information including the type information set, the root cause alarm information and the processing suggestion information is aggregated into a fault information set. Therefore, the type information set in each piece of fault information in the fault information set includes port alarm type, link fault alarm type, remote port blocking alarm type, transmission rate violation type, port abnormality log type and link abnormality log. type.
  • the root cause alarm information included in each piece of failure information in the failure information set is a port alarm type
  • the processing suggestion information included in each piece of failure information in the failure information set is the port where the restart root cause alarm event is located.
  • the OSS draws an object topology map based on the fault information including the object relationship data set, and displays the object topology map.
  • the fault information includes root cause alarm information, processing suggestion information, and each The occurrence time, object information, and type information of abnormal events are displayed for operation and maintenance personnel to view.
  • the OSS determines, according to the object topology diagram, the root cause alarm information, processing suggestion information, type information of each abnormal event, generation time, and/or object information, etc., that the fault information is used for root cause location. process, and/or, determine the object that needs to be repaired and the network element where the object is located, so as to prompt the operation and maintenance personnel to repair the object.
  • Step 205 For each piece of fault information in the fault information set, the OSS obtains a second type of relational data set corresponding to the fault information based on the object relational data set included in the fault information and the object information of each abnormal event.
  • the fault information set is any one of the one or more fault information sets aggregated in step 202 .
  • step 205 for any two abnormal events included in the fault information, for convenience of description, the two abnormal events are referred to as a first abnormal event and a second abnormal event. If there is object relationship data including the object information of the first abnormal event and the object information of the second abnormal event in the object relationship data set, the relationship included in the object relationship data is used as the type information of the first abnormal event and the type of the second abnormal event.
  • the relationship between the information, thereby obtaining a piece of type relationship data, the type relationship data includes the type information of the first abnormal event, the type information of the second abnormal event and the relationship.
  • the two fault events are a port alarm event and a link failure alarm event
  • the object of the port alarm event is port P1
  • the object where the link failure alarm event is located is the first link endpoint.
  • the type relationship data includes the type information of the port alarm event, the type information and the inclusion relationship of the link failure alarm event, and the type relationship data can be expressed as ⁇ port alarm type, link failure alarm type, Contains >. Repeat the above process for any other two abnormal events in Table 8, and finally obtain the second-type relational data set shown in Table 6.
  • step 205 is an optional step. For any piece of fault information in the fault information set, if the fault information includes the second-type relational data set, the operation of step 205 does not need to be performed.
  • Step 206 The OSS acquires a time span corresponding to the fault information based on the occurrence time of each abnormal event included in the fault information, where the time span is the time span of the occurrence time of each abnormal event.
  • the OSS reads the generation time of the latest abnormal event and the generation time of the earliest abnormal event from the fault information, and determines the time span corresponding to the fault information based on the two read occurrence times.
  • the OSS reads the generation time T6 of the latest link abnormality log event and the generation time T1 of the earliest port alarm event from the fault information, based on the generation time T6 and T1 It is determined that the time span corresponding to the fault information is T6-T1.
  • the above steps 205 and 206 are not executed in sequence, that is to say, the OSS may execute step 205 first, and then execute step 206; or, the OSS executes step 205 and step 206 at the same time; or, the OSS executes step 206 first 206, and step 205 is executed again.
  • Step 207 The OSS obtains a third type of relational data set by intersecting the second-type relational data set corresponding to the fault information in the fault information set, and selects the largest time span from the time spans corresponding to the fault information in the fault information set .
  • the OSS can be implemented through the following operations from 2071 to 2073, where the operations from 2071 to 2073 are respectively:
  • the OSS divides the fault information set into multiple subsets, and each subset includes the fault information corresponding to the second type of relational data set and the same time span.
  • the OSS can classify the fault information in the fault information set according to the second-type relational data set and time span, and classify the corresponding second-type relational data set and time span into one category and form one sub-collection.
  • the OSS selects multiple subsets from the divided subsets that satisfy a first specified condition, where the first specified condition is the difference between the total number of fault information included in the multiple subsets and the total number of fault information included in the fault information set.
  • the scale exceeds the specified scale.
  • the process of selecting the subsets in 2072 may be as follows: the OSS selects m subsets including the largest number of fault information from the divided subsets, where m is an integer greater than or equal to 2. Calculate the ratio between the total number of fault information included in the m subsets and the total number of fault information included in the fault information set, and when the ratio exceeds a specified ratio, perform the following operation 2073 .
  • the OSS selects the n subsets with the largest number of fault information from the remaining unselected subsets, where n is an integer greater than or equal to 1. In this case, a total of m+n subsets are selected. Calculate the ratio between the total number of fault information included in the m+n subsets and the total number of fault information included in the fault information set, and when the ratio exceeds a specified ratio, perform the following operation 2073 .
  • the specified ratio may be 0.9, 0.8 or 0.7, etc.
  • the OSS obtains a third type of relational data set by intersecting the second-type relational data set corresponding to each fault information in the selected subset, and selects from the time span corresponding to each fault information in the selected subset Maximum time span.
  • the maximum time span is the duration of the abnormal event caused by the fault corresponding to the fault information set.
  • Step 208 The OSS sends a fault processing rule to at least one EMS, where the fault processing rule includes the third type relational data set, the maximum time span, the type information set included in the first fault information, the root cause alarm information and the processing suggestion information, and the first A fault information is any piece of fault information in the set of fault information.
  • the OSS composes the third type relational data set, the maximum time span, the type information set included in the first fault information, the root cause alarm information and the processing suggestion information into a fault processing rule, and sends it to at least one of the communication with the OSS.
  • the EMS sends the fault handling rule.
  • Each piece of type relation data in the third type relation data set includes two types of information, and the OSS determines whether the type information included in the third type relation data set includes all type information in the type information set in the first fault information. If all type information in the type information set is included, the third type relation data set, the maximum time length, the type information set included in the first fault information, the root cause alarm information and the processing suggestion information constitute a fault processing rule. If all type information in the type information set is not included, the third type relation data set, the maximum time span and the failure information set are discarded.
  • step 204 multiple sets of fault information may be aggregated, and the OSS performs the operations of steps 205 to 208 above for each set of fault information, and may obtain one or more fault handling rules, and then send the information to at least one EMS that communicates with the OSS.
  • the one or more fault handling rules may be aggregated, and the OSS performs the operations of steps 205 to 208 above for each set of fault information, and may obtain one or more fault handling rules, and then send the information to at least one EMS that communicates with the OSS.
  • the one or more fault handling rules may be aggregated, and the OSS performs the operations of steps 205 to 208 above for each set of fault information, and may obtain one or more fault handling rules, and then send the information to at least one EMS that communicates with the OSS.
  • the EMS After the OSS sends the fault handling rule to the EMS that communicates with the OSS, the EMS receives the fault handling rule and saves the fault handling rule. Afterwards, the fault processing rule can also be used to match the event information of the abnormal event caused by the same fault, and/or to process the fault of the network element, that is, the EMS continues to execute from the above step 201 .
  • the operation of the EMS to save the fault handling rule may be: for the type information set, root cause alarm information, and processing suggestion information in the fault handling rule, locally stored in the EMS includes the type information set, the root cause
  • the fault processing rule locally saved by the EMS including the type information set, the root cause alarm information and the processing suggestion information is updated to the fault processing rule.
  • the EMS does not locally save the fault processing rule including the type information set, the root cause alarm information and the processing suggestion information, the fault processing rule is directly saved.
  • the EMS receives the event information of multiple abnormal events, uses the fault processing rules included in the EMS to identify the event information of each abnormal event caused by the same fault, and based on the event information of each abnormal event, obtains Fault information, the fault information includes the type information of each abnormal event, the generation time of each abnormal event, the object information of each abnormal event, the object relationship data set, the root cause alarm information and processing in the fault processing rule Suggested information. Since the fault information includes the type information, object information and object relationship data set of each abnormal event, OSS can obtain the type information of different abnormal events according to the type information, object information and object relationship data set of each abnormal event. The relationship between the fault processing rules can then be generated, and the fault processing rules can be sent to different EMSs.
  • the OSS Since different EMSs can receive fault processing rules for processing faults, different EMSs can process the faults. Since the OSS receives multiple fault information sent by different EMSs, it can automatically generate fault handling rules based on a large amount of fault information, which makes the fault handling rules learnable and updateable, improves the accuracy of generating fault handling rules, and improves the generation of fault handling rules. the efficiency of the rules. The OSS sends the generated fault handling rules to different EMSs, which can improve the efficiency and real-time performance of configuring fault handling rules in EMS, and ensure that different EMSs can handle the faults corresponding to the fault handling rules based on the fault handling rules. .
  • an embodiment of the present application provides a method for handling faults, and the method can apply the network architecture shown in FIG. 1 .
  • the EMS generates fault information
  • the fault information includes a type information set, a second type Relational data set, root cause alarm information, processing suggestion information, and time span.
  • OSS receives fault information sent by EMS, and generates fault handling rules based on the received fault information.
  • the method includes:
  • Steps 301-302 are respectively the same as steps 201-202, and will not be described in detail here.
  • the EMS can obtain the event information of at least one abnormal event caused by the same fault, the second-type relational data set, and the target fault processing rule matching the at least one abnormal event.
  • the EMS still receives the port alarm events, link failure alarm events, remote port blocking alarm events, transmission rate violation events, port exception log events, and link exception events shown in Table 1, Table 2, and Table 3 above.
  • the log event Take the log event as an example, after the EMS has processed the six abnormal events from the above steps 301 to 302, it is concluded that the six abnormal events are abnormal events caused by the same fault, and the results are as shown in Table 4.
  • the target fault handling rules matched by the six abnormal events and the second type of relational data set as shown in Table 6.
  • Step 303 The EMS sends fault information to the OSS, where the fault information includes a time span, a second-type relational data set, a set of type information in the target fault processing rule, root cause alarm information, and processing suggestion information, and the time span is the at least one The time span of the occurrence time of each of the exception events.
  • the EMS obtains a time span based on the generation time of each abnormal event in the at least one abnormal event, and obtains the time span, the second type relational data set, the type information set in the target fault processing rule, and the root cause alarm.
  • the information and processing suggestion information constitute fault information, and the fault information is sent to the OSS.
  • the EMS is based on the generation time T1 of the port alarm event, the generation time T2 of the link failure alarm event, the generation time T3 of the remote port blocking alarm event, the generation time T4 of the transmission rate violation event, and the generation time of the port abnormal log event.
  • the generation time of T5 and link exception log events is T6, and the acquisition time span is T6-T1.
  • the time span T6-T1 the second type relationship data set shown in Table 6, the type information set in the target fault processing rule shown in Table 4, the root cause alarm information and the processing suggestion information are composed as shown in Table 9 below. displayed fault information.
  • the data volume of the event type set included in the fault information is smaller than the data volume of the event information of the at least one abnormal event, the data volume of the fault information in this step is small, and the data volume of the fault information can be reduced, thereby reducing the data volume of the fault information.
  • the occupation of network resources can be reduced.
  • the EMS may determine event information of abnormal events caused by multiple faults, generate fault information corresponding to each fault according to the event information of abnormal events caused by each fault, and send the fault information corresponding to each fault to the OSS.
  • EMSs may send the generated fault information to the OSS.
  • Step 304 The OSS receives multiple fault information, and among the multiple fault information, aggregates the fault information including the same type of information set, root cause alarm information and processing suggestion information into a first fault information set.
  • step 204 For the detailed implementation process of the OSS aggregation to obtain the first fault information set, reference may be made to the relevant content in step 204 in the embodiment shown in FIG. 2 , which will not be described in detail here.
  • Step 305 The OSS obtains a third type of relational data set by intersecting the second-type relational data set included in the fault information in the first fault information set, and selects the maximum time from the time spans included in the fault information in the fault information set span.
  • the OSS can be implemented through the following operations from 3051 to 3053.
  • the operations from 3051 to 3053 are respectively:
  • the OSS For any identical pieces of fault information in the first set of fault information, the OSS counts the number of pieces of the pieces of fault information, and performs deduplication processing in the first set of fault information to obtain a second set of fault information, so that the Each piece of fault information retained in the second set of fault information is different.
  • each piece of fault information in the second fault information set obtained after deduplication corresponds to a number of pieces of information.
  • the OSS selects multiple fault information that meets the second specified condition from the deduplicated fault information set, where the second specified condition is the cumulative value of the number of pieces of information corresponding to the multiple fault information and the number of each fault information included in the second fault information set.
  • the ratio between the total accumulated values of the pieces of fault information corresponding to the number of pieces of information exceeds the specified ratio.
  • the process of selecting multiple pieces of fault information in 3052 may be as follows: the OSS selects m pieces of fault information with the largest number of corresponding pieces of information from the second set of fault information, where m is an integer greater than or equal to 2. Calculate the ratio between the accumulated value of the number of pieces of information corresponding to the m pieces of fault information and the total accumulated value of the number of pieces of information corresponding to each piece of fault information included in the second set of fault information. If the ratio exceeds the specified ratio, execute the following 3053 operation.
  • the OSS selects the n fault information with the largest number of corresponding information from the remaining unselected fault information in the second fault information set, where n is an integer greater than or equal to 1, and m+ is selected in total.
  • n fault messages Calculate the ratio between the accumulated value of the number of pieces of information corresponding to the m+n pieces of fault information and the total accumulated value, and when the ratio exceeds the specified ratio, perform the following operations of 2053.
  • select m+2n fault information in total, and then execute the above calculation ratio and the operation of selecting fault information until the selected plurality of fault information satisfies the second specified condition.
  • the OSS obtains a third type of relational data set by intersecting the second-type relational data set included in each selected fault information, and selects the largest time span from the time spans included in each selected fault information.
  • the maximum time span is the duration of the abnormal event caused by the fault corresponding to the fault information set.
  • Step 306 The OSS sends a fault processing rule to at least one EMS, where the fault processing rule includes the third type relational data set, the maximum time span, the type information set included in the first fault information, the root cause alarm information and the processing suggestion information, and the first One piece of fault information is any piece of fault information in the second set of fault information.
  • the OSS composes the third type relational data set, the maximum time span, the type information set included in the first fault information, the root cause alarm information and the processing suggestion information into fault processing rules, and sends them to at least one of the communication with the OSS.
  • the EMS sends the fault handling rule.
  • Each piece of type relation data in the third type relation data set includes two types of information, and the OSS determines whether the type information included in the third type relation data set includes all type information in the type information set in the first fault information. If all type information in the type information set is included, the third type relation data set, the maximum time length, the type information set included in the first fault information, the root cause alarm information and the processing suggestion information constitute a fault processing rule. If all type information in the type information set is not included, the third type relation data set, the maximum time span and the failure information set are discarded.
  • step 304 multiple sets of fault information may be aggregated, and the OSS performs the operation of step 305 on each set of fault information, and may obtain one or more fault handling rules, and then sends the one or more fault handling rules to at least one EMS that communicates with the OSS. or multiple failure handling rules.
  • the EMS After the OSS sends the fault handling rule to the EMS that communicates with the OSS, the EMS receives the fault handling rule and saves the fault handling rule. Afterwards, the fault processing rule can also be used to match the event information of the abnormal event caused by the same fault, and/or to process the fault of the network element, that is, the EMS continues to execute from the above step 301 .
  • the operation of the EMS to save the fault handling rule may be: for the type information set, root cause alarm information, and processing suggestion information in the fault handling rule, locally stored in the EMS includes the type information set, the root cause
  • the fault processing rule locally saved by the EMS including the type information set, the root cause alarm information and the processing suggestion information is updated to the fault processing rule.
  • the EMS does not locally save the fault processing rule including the type information set, the root cause alarm information and the processing suggestion information, the fault processing rule is directly saved.
  • the EMS receives the event information of multiple abnormal events, uses the fault processing rules included in the EMS to identify the event information of the abnormal event caused by the same fault, and generates a fault according to the event information of the abnormal event caused by the fault information, the fault information includes a type information set, a second type relational data set, root cause alarm information and processing suggestion information. Since the fault information includes the second-type relational data set, the OSS can generate a fault-handling rule according to the second-type relational data set in the fault information, and send the fault-handling rule to different EMSs. Since different EMSs can receive fault processing rules for processing faults, different EMSs can process the faults.
  • the OSS Since the OSS receives multiple fault information sent by different EMSs, it can automatically generate fault handling rules based on a large amount of fault information, which makes the fault handling rules learnable and updateable, improves the accuracy of generating fault handling rules, and improves the generation of fault handling rules. the efficiency of the rules.
  • the OSS sends the generated fault handling rules to different EMSs, which improves the efficiency and real-time of configuring fault handling rules in EMS, and ensures that different EMSs can process the corresponding fault handling rules based on the fault handling rules. Fault.
  • an embodiment of the present application provides an apparatus 400 for processing faults.
  • the apparatus 400 may be deployed in the OSS provided by the embodiment shown in FIG. 1 , FIG. 2 or FIG. 3 , including:
  • the receiving unit 401 is configured to receive fault information sent by at least one network element management system EMS, where the fault information includes a type information set, a first relationship set, the root cause alarm information of the fault, processing suggestion information for the fault, and information about the fault caused by the fault.
  • the type information set includes the type information of the multiple abnormal events
  • the first relationship set includes the relationship between the multiple abnormal events;
  • the processing unit 402 is configured to aggregate fault information including the same type of information set, root cause alarm information and processing suggestion information into a fault information set, and obtain the intersection of the first relationship set in each piece of fault information in the fault information set a second relationship set, and determining a time length according to the time information in each piece of fault information, where the time length is the duration of the abnormal event caused by the fault;
  • the sending unit 403 is configured to send a fault processing rule to at least one EMS, where the fault processing rule includes the second relationship set, the time length, the type information set in the first fault information, the root cause alarm information and the processing suggestion information, the first
  • the fault information is any piece of fault information in the set of fault information, and the fault processing rule is used to instruct the at least one EMS to handle the fault.
  • the relationship between the multiple abnormal events is represented by at least one piece of type relationship data, and each piece of type relationship data includes type information of two abnormal events in the multiple abnormal events and type information of the two abnormal events.
  • the relationship between the two exception events exists between the objects.
  • the fault information further includes object information of the object where each abnormal event in the plurality of abnormal events is located;
  • the relationship between the plurality of abnormal events is represented by at least one piece of object relationship data, and each piece of object relationship data includes the existence of The object information of the two objects of the relationship and the relationship, and the object where each abnormal event is located includes the two objects;
  • the processing unit 402 is further configured to:
  • the object information and type information of each abnormal event a set of type relationship data corresponding to the fault information is obtained, the set of type relationship data includes at least one piece of type relationship data, and each piece of type relationship data includes the multiple The relationship between the type information of two abnormal events in the two abnormal events and the type information of the two abnormal events, and the relationship exists between the objects where the two abnormal events are located;
  • the second relation set is obtained by taking the intersection of the type relation data sets corresponding to each piece of fault information in the fault information set.
  • step 205 for the detailed implementation process for the processing unit 402 to acquire the type relationship data set corresponding to the fault information, reference may be made to the relevant content in step 205 in the embodiment shown in FIG. 2 , which will not be described in detail here.
  • the time information of the multiple abnormal events is the time span of the generation time of the multiple abnormal events
  • the processing unit 402 is configured to select the maximum time span as the time length from the time spans in each piece of fault information.
  • the time information of the multiple abnormal events is the generation time of each abnormal event in the multiple abnormal events
  • the processing unit 402 is configured to obtain the time span of the occurrence time of the abnormal event corresponding to each piece of fault information according to the time information included in each piece of fault information; and select the maximum time span as the time length from the obtained time spans.
  • the types of the multiple abnormal events include alarm types, performance overrun types, and/or network element abnormal log types.
  • processing unit 402 is further configured to:
  • An object topology map is generated based on the at least one piece of object relationship data, and the object topology map is displayed.
  • the processing unit can generate a fault processing rule based on the first relationship set and time information in the fault information. Since the receiving unit can receive the fault information sent by different EMSs, the processing unit generates fault processing rules based on the fault information of each EMS, which enriches the data required for generating fault processing, not only makes the fault processing rules learnable and updateable, but also improves the The efficiency of generating the fault processing rule, the sending unit sends the fault processing rule to at least one EMS, thereby improving the efficiency and real-time performance of configuring the fault processing rule in the EMS.
  • an embodiment of the present application provides an apparatus 500 for handling faults.
  • the apparatus 500 may be deployed in the EMS provided by the embodiment shown in FIG. 1 , FIG. 2 or FIG. 3 , including:
  • the processing unit 501 is configured to acquire multiple first abnormal event information caused by the first fault reported by at least one network element managed by the device 500, and acquire the root cause alarm information and processing suggestion information of the first fault.
  • An abnormal event information includes the type information of the first abnormal event, the generation time and the object information of the object where it is located;
  • the processing unit 501 is further configured to obtain a first relationship set based on the object information of each first abnormal event, where the first relationship set includes the relationship between the multiple first abnormal events;
  • the sending unit 502 is configured to send first fault information to the operation support system OSS, where the first fault information includes type information set, first time information, first relationship set, the root cause alarm information and the processing suggestion information, the type information
  • the set includes type information of each first abnormal event
  • the first time information includes the generation time of the multiple first abnormal events or the time span of the generation time of the multiple first abnormal events
  • the first fault information is used for OSS generation a first fault processing rule, where the first fault processing rule is used to instruct at least one network element management system EMS that receives the first fault processing rule to process the first fault;
  • the first fault processing rule includes the second relationship set, the first time length, the type information set in any piece of the first fault information in the fault information set, the root cause alarm information and the processing suggestion information
  • the first fault information set includes Multiple pieces of first fault information received by the OSS, and each piece of first fault information in the multiple pieces of first fault information includes the same set of type information, root cause alarm information, and processing suggestion information
  • the second relationship set is the set of first fault information received by the OSS.
  • the intersection of the first relationship sets in the multiple pieces of first fault information, the first time length is obtained by the OSS based on the first time information in the multiple pieces of first fault information, and the first time length is the first time length caused by the occurrence of the first fault. The duration of an anomalous event.
  • step 202 for a detailed implementation process for the processing unit 501 to acquire the information of multiple first abnormal events caused by the first fault, reference may be made to the relevant content of step 202 in the embodiment shown in FIG. 2 , which will not be described in detail here.
  • step 2024 for the detailed implementation process of acquiring the first relationship set by the processing unit 501, reference may be made to the relevant content of step 2024 in the embodiment shown in FIG. 2 , which will not be described in detail here.
  • the relationship between the plurality of first abnormal events is represented by at least one piece of type relationship data, and each piece of type relationship data includes type information of two first abnormal events in the plurality of first abnormal events and the two types of abnormal events.
  • the processing unit 501 is configured to acquire a first relationship set based on the network topology map and/or the object information and type information of each first abnormal event.
  • step 2024 for the detailed implementation process of acquiring at least one piece of type relationship data by the processing unit 501 , reference may be made to the relevant content of step 2024 in the embodiment shown in FIG. 2 , which will not be described in detail here.
  • the relationship between the plurality of first abnormal events is represented by at least one piece of object relationship data, and each piece of object relationship data includes object information of two objects that have a relationship and the relationship, among the plurality of first abnormal events.
  • the object where each first abnormal event is located includes the two objects, and the device 500 includes a network topology map;
  • the processing unit 501 is configured to obtain a first relationship set based on the network topology map and/or the object information of each first abnormal event.
  • step 2024 for the detailed implementation process of acquiring at least one piece of object relationship data by the processing unit 501, reference may be made to the relevant content of step 2024 in the embodiment shown in FIG. 2, and details are not described herein again.
  • the first fault information further includes object information of each first abnormal event.
  • the types of the plurality of first abnormal events include an alarm type, a performance overrun type, and/or a network element abnormal log type.
  • the apparatus 500 further includes: a receiving unit 503,
  • a receiving unit 503, configured to receive the first fault handling rule
  • the processing unit 501 is further configured to obtain information about multiple second abnormal events caused by the second fault reported by at least one network element based on the first fault processing rule, the first fault and the second fault are of the same type, and the second abnormal event
  • the information includes the type information of the second abnormal event, the generation time and the object information of the object where it is located; a third relationship set is obtained based on the object information of each second abnormal event, and the third relationship set includes the relationship between the plurality of second abnormal events. relation;
  • the sending unit 502 is further configured to send the second fault information to the OSS, where the second fault information includes the second time information, the third relationship set, the type information set, the root cause alarm information and the processing suggestion information, and the second time information Including the generation time of the multiple second abnormal events or the time span of the generation time of the multiple second abnormal events, the second fault information is used to trigger the OSS to generate the second fault processing rule, and the second fault processing rule includes the fourth relationship set, the second time length, the type of information set, the root cause alarm information, and the processing suggestion information, the fourth relationship set is the intersection of the third relationship set in the multiple pieces of second fault information received by the OSS. Both fault information includes the type information set, the root cause alarm information and the processing suggestion information.
  • the second time length is obtained by the OSS based on the second time information in the multiple pieces of second fault information, and the second time length is generated The duration of the second abnormal event caused by the second fault;
  • the receiving unit 503 is further configured to receive the second fault processing rule
  • the processing unit 501 is further configured to update the first fault processing rule to the second fault processing rule.
  • the processing unit since the processing unit obtains the first relationship set based on the object information of each first abnormal event, the first relationship set includes the relationship between the multiple first abnormal events, so the sending unit sends the data to the OSS.
  • the fault information includes the first relationship set and time information of multiple abnormal events.
  • the OSS can obtain the second relationship set by taking the intersection of the first relationship set included in each first fault information in the first fault information set, and determine the first relationship set based on the time information included in each first fault information in the first fault information set. A time length can be obtained, so that a fault handling rule including the second relationship set and the first time length can be obtained.
  • the OSS can receive the first fault information sent by different EMSs, and generate fault handling rules based on the first fault information of each EMS, the data required for generating fault handling is enriched, and the generated fault handling rules are learnable and updateable. It also improves the efficiency of generating fault processing rules.
  • the OSS sends the fault processing rules to the EMS, thereby improving the efficiency and real-time performance of configuring fault processing rules in the EMS.
  • an embodiment of the present application provides a schematic diagram of an apparatus 600 for processing faults.
  • the apparatus 600 may be the OSS in any of the above embodiments.
  • the apparatus 600 includes at least one processor 601 , internal connections 602 , memory 603 and at least one transceiver 604 .
  • the apparatus 600 is an apparatus with a hardware structure, which can be used to implement the functional modules in the apparatus 700 described in FIG. 7 .
  • the processing unit 402 in the apparatus 400 shown in FIG. 4 can be implemented by calling the code in the memory 603 by the at least one processor 601, and the receiving unit 401 and the receiving unit 401 in the apparatus 400 shown in FIG.
  • the sending unit 403 can be implemented by the transceiver 604 .
  • the apparatus 600 may also be used to implement the functions of the OSS in any of the foregoing embodiments.
  • processor 601 may be a general-purpose central processing unit (central processing unit, CPU), network processor (network processor, NP), microprocessor, application-specific integrated circuit (application-specific integrated circuit, ASIC) , or one or more integrated circuits used to control the execution of the program of this application.
  • CPU central processing unit
  • NP network processor
  • ASIC application-specific integrated circuit
  • the internal connection 602 described above may include a path to transfer information between the above described components.
  • the internal connection 602 is a single board or a bus or the like.
  • the above transceiver 604 is used to communicate with other devices or communication networks.
  • the above-mentioned memory 603 can be a read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, a random access memory (random access memory, RAM) or other types of storage devices that can store information and instructions.
  • ROM read-only memory
  • RAM random access memory
  • Types of dynamic storage devices which can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), or other optical storage, CD-ROM storage (including compact discs, laser discs, compact discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being accessed by Any other medium accessed by the computer, but not limited to this.
  • the memory can exist independently and be connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory 603 is used for storing the application code for executing the solution of the present application, and the execution is controlled by the processor 601 .
  • the processor 601 is used to execute the application program code stored in the memory 603 and cooperate with at least one transceiver 604, so that the device 600 can realize the functions in the method of the present patent.
  • the processor 601 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 6 .
  • the apparatus 600 may include multiple processors, such as the processor 601 and the processor 607 in FIG. 6 .
  • processors can be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor.
  • a processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • an embodiment of the present application provides a schematic diagram of an apparatus 700 for processing faults.
  • the apparatus 700 may be the EMS in any of the above embodiments.
  • the apparatus 700 includes at least one processor 701 , internal connections 702 , memory 703 and at least one transceiver 704 .
  • the apparatus 700 is an apparatus with a hardware structure, which can be used to implement the functional modules in the apparatus 700 described in FIG. 7 .
  • the processing unit 501 in the apparatus 500 shown in FIG. 5 can be implemented by calling the code in the memory 703 by the at least one processor 701, and the sending unit 502 and the sending unit 502 in the apparatus 900 shown in FIG.
  • the receiving unit 503 can be realized by the transceiver 704 .
  • the apparatus 700 may also be used to implement the functions of the EMS in any of the foregoing embodiments.
  • processor 701 may be a general-purpose central processing unit (central processing unit, CPU), network processor (network processor, NP), microprocessor, application-specific integrated circuit (application-specific integrated circuit, ASIC) , or one or more integrated circuits used to control the execution of the program of this application.
  • CPU central processing unit
  • NP network processor
  • ASIC application-specific integrated circuit
  • the internal connection 702 described above may include a path to transfer information between the aforementioned components.
  • the internal connection 702 is a single board or a bus or the like.
  • the above transceiver 704 is used to communicate with other devices or communication networks.
  • the above-mentioned memory 703 may be a read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, a random access memory (random access memory, RAM) or other types of storage devices that can store information and instructions.
  • ROM read-only memory
  • RAM random access memory
  • Types of dynamic storage devices which can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), or other optical storage, CD-ROM storage (including compact discs, laser discs, compact discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being accessed by Any other medium accessed by the computer, but not limited to this.
  • the memory can exist independently and be connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory 703 is used for storing the application program code for executing the solution of the present application, and the execution is controlled by the processor 701.
  • the processor 701 is configured to execute the application program code stored in the memory 703, and cooperate with at least one transceiver 704, so that the apparatus 700 realizes the functions in the method of the present patent.
  • the processor 701 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 7 .
  • the apparatus 700 may include multiple processors, such as the processor 701 and the processor 707 in FIG. 7 .
  • processors can be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor.
  • a processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • an embodiment of the present application provides a system 800 for troubleshooting, where the system 800 includes the apparatus 400 described in FIG. 4 and the apparatus 500 described in FIG. 5 , or the system 800 includes the apparatus shown in FIG. The device 600 described in FIG. 6 and the device 700 described in FIG. 7 .
  • the apparatus 400 shown in FIG. 4 or the apparatus 600 shown in FIG. 6 may be OSS801, and the apparatus 500 shown in FIG. 5 or the apparatus 700 shown in FIG. 7 may be EMS802.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请公开了一种处理故障的方法、装置及系统,属于通信领域。所述方法包括:接收故障信息,故障信息包括类型信息集合、第一关系集合、故障的根因告警信息、处理建议信息和多个异常事件的时间信息,第一关系集合包括多个异常事件之间的关系。将包括相同类型信息集合、根因告警信息和处理建议信息的故障信息聚合成故障信息集合,将故障信息集合中的每条故障信息中的第一关系集合取交集得到第二关系集合,以及根据每条故障信息中的时间信息确定时间长度。发送故障处理规则,故障处理规则包括第二关系集合、时间长度、第一故障信息中的类型信息集合、根因告警信息和处理建议信息。本申请能够提高配置故障处理规则的效率和实时性。

Description

处理故障的方法、装置及系统
相关申请的交叉引用
本申请要求在2020年11月16日提交的申请号为“202011281081.9”、申请名称为“处理故障的方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信领域,特别涉及一种处理故障的方法、装置及系统。
背景技术
通信网络往往包括多个网元,网元在运行的过程中可能会发生故障,所以处理故障是对通信网络运维过程中的一项重要工作。
为了能够处理网元中的故障,可以在通信网络中部署网元管理系统(element management system,EMS),EMS可以与至少一个网元通信。技术人员可以手动为EMS配置故障处理规则,这样EMS可以通过该故障处理规则处理该至少一个网元中发生的故障。
在实现本申请的过程中,发明人发现现有技术至少存在以下问题:
在EMS中故障处理规则只能由人工根据经验进行各EMS的单独配置,各EMS配置好的故障处理规则不具有可学习和更新性,从而导致对EMS的故障处理规则的配置既效率低,又不实时。
发明内容
本申请提供了一种处理故障的方法、装置及系统,以提高配置故障处理规则的效率和实时性。所述技术方案如下:
第一方面,本申请提供了一种处理故障的方法,在所述方法中:运营支撑系统OSS接收至少一个网元管理系统EMS发送的故障信息,该故障信息包括类型信息集合、第一关系集合、故障的根因告警信息、针对该故障的处理建议信息和由该故障引起的多个异常事件的时间信息,该类型信息集合包括该多个异常事件的类型信息,第一关系集合包括该多个异常事件之间的关系。OSS将包括相同类型信息集合、根因告警信息和处理建议信息的故障信息聚合成故障信息集合,将该故障信息集合中的每条故障信息中的第一关系集合取交集得到第二关系集合,以及根据每条故障信息中的时间信息确定时间长度,该时间长度是产生该故障引起的异常事件的持续时长。OSS向至少一个EMS发送故障处理规则,该故障处理规则包括第二关系集合、该时间长度、第一故障信息中的类型信息集合、根因告警信息和处理建议信息,第一故障信息是该故障信息集合中的任意一条故障信息,该故障处理规则用于指示至少一个EMS处理该故障。
由于EMS发送的故障信息包括第一关系集合和多个异常事件的时间信息,这样OSS可以将故障信息集合中的各故障信息包括的第一关系集合取交集得到第二关系集合,以及基于故障信息集合中的各故障信息包括的时间信息确定时间长度,从而可以得到一条包括第二关系集合和该时间长度等内容的故障处理规则。由于OSS可以接收不同EMS发送的故障信息,并基于各EMS的故障信息生成故障处理规则,丰富了生成故障处理所需的数据,不仅使得生成的故障处理规则具有可学习性和更新性,还提高生成故障处理规则的效率,OSS向至少一个EMS发送该故障处理规则,从而提高在EMS中配置故障处理规则的效率和实时性。
在一种可能的实现方式中,该多个异常事件之间的关系使用至少一条类型关系数据表示,每条类型关系数据包括该多个异常事件中的两个异常事件的类型信息和该两个异常事件的类型信息之间的关系,该两个异常事件所在对象之间存在该关系。由于第一关系集合中的异常事件之间的关系用类型关系数据来表示,这样OSS直接可以对故障信息集合中的各故障信息包括的第一关系集合取交集,从而提高生成故障处理规则的效率。
在另一种可能的实现方式中,该故障信息还包括该多个异常事件中的每个异常事件所在对象的对象信息;该多个异常事件之间的关系使用至少一条对象关系数据表示,每条对象关系数据包括存在关系的两个对象的对象信息和该关系,每个异常事件所在的对象包括该两个对象。OSS根据该至少一条对象关系数据、每个异常事件的对象信息和类型信息,获取该故障信息对应的类型关系数据集合,该类型关系数据集合包括至少一条类型关系数据,每条类型关系数据包括该多个异常事件中的两个异常事件的类型信息和该两个异常事件的类型信息之间的关系,该两个异常事件所在对象之间存在该关系。OSS将该故障信息集合中的每条故障信息对应的类型关系数据集合取交集得到第二关系集合。
由于第一关系集合中的异常事件之间的关系用至少一条对象关系数据来表示,以及故障信息还包括每个异常事件的对象信息,这样OSS利用该至少一条对象关系数据、每个异常事件的对象信息和类型信息,不仅可以生成故障处理规则,还可以基于该至少一条对象关系数据和每个异常事件的对象信息实现其他应用。例如,OSS可以基于该至少一条对象关系数据生成对象拓扑图,或者,基于该至少一条对象关系数据和每个异常事件的对象信息确定需要维修的对象。
在另一种可能的实现方式中,多个异常事件的时间信息是该多个异常事件的产生时间的时间跨度。OSS从每条故障信息中的时间跨度中选择最大时间跨度作为该时间长度。由于故障信息中的时间信息是一个时间跨度,这样可以减小故障信息的数据量,另外,OSS选择最大的时间跨度作为该时间长度,这样使该时间长度能够覆盖由该故障引起的各异常事件。
在另一种可能的实现方式中,该多个异常事件的时间信息是该多个异常事件中的每个异常事件的产生时间。OSS根据每条故障信息包括的时间信息,分别获取每条故障信息对应的异常事件的产生时间的时间跨度。OSS从获取的时间跨度中选择最大时间跨度作为该时间长度。由于该故障信息中包括每个异常事件的产生时间,这样OSS基于各故障信息中的异常事件的产生时间,不仅可以得出故障处理规则中的时间长度,还可以实现其他应用。例如,可 以显示故障信息中的异常事件的产生时间,以给运维人员观看,或者,基于故障信息中的异常事件的产生时间确定EMS做根因定位的过程和/或确定需要维修的对象。
在另一种可能的实现方式中,该多个异常事件的类型包括告警类型、性能越限类型和/或网元异常日志类型。
在另一种可能的实现方式中,OSS基于该至少一条对象关系数据生成对象拓扑图,显示该对象拓扑图,以供运维人员查看。
第二方面,本申请提供了一种处理故障的方法,在所述方法中,网元管理系统EMS获取EMS所管理的至少一个网元上报的由第一故障引起的多个第一异常事件信息,以及获取第一故障的根因告警信息和处理建议信息,第一异常事件信息包括第一异常事件的类型信息、产生时间和所在对象的对象信息。EMS基于每个第一异常事件的对象信息获取第一关系集合,第一关系集合包括该多个第一异常事件之间的关系。EMS向运营支撑系统OSS发送第一故障信息,第一故障信息包括类型信息集合、第一时间信息、第一关系集合、该根因告警信息和该处理建议信息,类型信息集合包括每个第一异常事件的类型信息,第一时间信息包括该多个第一异常事件的产生时间或该多个第一异常事件的产生时间的时间跨度,第一故障信息用于OSS生成第一故障处理规则,第一故障处理规则用于指示接收第一故障处理规则的至少一个EMS处理第一故障。
其中,第一故障处理规则包括第二关系集合、第一时间长度、故障信息集合中的任一条第一故障信息中的类型信息集合、根因告警信息和处理建议信息,第一故障信息集合包括OSS接收的多条第一故障信息且该多条第一故障信息中的每条第一故障信息包括相同的类型信息集合、根因告警信息和处理建议信息,第二关系集合是OSS接收的该多条第一故障信息中的第一关系集合的交集,第一时间长度是OSS基于该多条第一故障信息中的第一时间信息得到的,第一时间长度是产生第一故障引起的第一异常事件的持续时长。
由于EMS向OSS发送的故障信息包括第一关系集合和多个异常事件的时间信息,这样OSS可以将第一故障信息集合中的各第一故障信息包括的第一关系集合取交集得到第二关系集合,以及基于第一故障信息集合中的各第一故障信息包括的时间信息确定第一时间长度,从而可以得到一条包括第二关系集合和第一时间长度等内容的故障处理规则。由于OSS可以接收不同EMS发送的第一故障信息,并基于各EMS的第一故障信息生成故障处理规则,丰富了生成故障处理所需的数据,不仅使得生成的故障处理规则具有可学习性和更新性,还提高生成故障处理规则的效率,OSS向EMS发送该故障处理规则,从而提高在EMS中配置故障处理规则的效率和实时性。
在一种可能的实现方式中,该多个第一异常事件之间的关系使用至少一条类型关系数据表示,每条类型关系数据包括该多个第一异常事件中的两个第一异常事件的类型信息和该两个第一异常事件的类型信息之间的关系,该两个第一异常事件所在对象之间存在该关系,EMS包括网络拓扑图。EMS基于该网络拓扑图和/或每个第一异常事件的对象信息和类型信息,获取第一关系集合。由于EMS获取的第一关系集合中的异常事件之间的关系用类型关系数据来 表示,这样OSS对第一故障信息集合中的各故障信息包括的第一关系集合取交集,从而提高生成故障处理规则的效率。
在另一种可能的实现方式中,该多个第一异常事件之间的关系使用至少一条对象关系数据表示,每条对象关系数据包括存在关系的两个对象的对象信息和该关系,该多个第一异常事件中的每个第一异常事件所在的对象包括该两个对象,该EMS包括网络拓扑图。EMS基于该网络拓扑图和/或每个第一异常事件的对象信息,获取第一关系集合。由于EMS获取的第一关系集合包括至少一条对象关系数据,这样OSS利用第一关系集合,不仅可以生成故障处理规则,还可以基于第一关系集合实现其他应用。例如,OSS可以基于第一关系集合生成对象拓扑图,或者,基于第一关系集合确定需要维修的对象。
在另一种可能的实现方式中,第一故障信息还包括每个第一异常事件的对象信息。这样OSS利用第一关系集合、每个异常事件的对象信息和类型信息,可以生成故障处理规则。
在另一种可能的实现方式中,该多个第一异常事件的类型包括告警类型、性能越限类型和/或网元异常日志类型。
在另一种可能的实现方式中,EMS接收第一故障处理规则,基于第一故障处理规则获取该至少一个网元上报的由第二故障引起的多个第二异常事件信息,第一故障和第二故障是同类型故障,第二异常事件信息包括第二异常事件的类型信息、产生时间和所在对象的对象信息。EMS基于每个第二异常事件的对象信息获取第三关系集合,第三关系集合包括该多个第二异常事件之间的关系。EMS向OSS发送第二故障信息,第二故障信息包括第二时间信息、第三关系集合、该类型信息集合、该根因告警信息和该处理建议信息,第二时间信息包括该多个第二异常事件的产生时间或该多个第二异常事件的产生时间的时间跨度。
其中,OSS可以基于第二故障信息用于生成第二故障处理规则,第二故障处理规则包括第四关系集合、第二时间长度、该类型信息集合、该根因告警信息和该处理建议信息,第四关系集合是OSS接收的多条第二故障信息中的第三关系集合的交集,该多条第二故障信息均包括该类型信息集合、该根因告警信息和该处理建议信息,第二时间长度是OSS基于该多条第二故障信息中的第二时间信息得到的,第二时间长度是产生第二故障引起的第二异常事件的持续时长。EMS接收第二故障处理规则,将第一故障处理规则更新为第二故障处理规则。这样可以OSS可以基于不同EMS的第二故障信息,产生故障处理规则并向EMS发送该故障处理规则,以使EMS可以及时更新故障处理规则,使得故障处理规则具有实时性,并提高更新的效率。
第三方面,本申请提供了一种处理故障的装置,用于执行第一方面或第一方面的任意一种可能的实现方式中的由OSS执行的方法。具体地,所述装置包括用于执行第一方面或第一方面的任意一种可能的实现方式中的由OSS执行的方法的单元。
第四方面,本申请提供了一种处理故障的装置,用于执行第二方面或第二方面的任意一 种可能的实现方式中的由EMS执行的方法。具体地,所述装置包括用于执行第二方面或第二方面的任意一种可能的实现方式中的由EMS执行的方法的单元。
第五方面,本申请提供了一种处理故障的装置,所述装置包括收发器、处理器和存储器。其中,所述收发器、所述处理器以及所述存储器之间可以通过内部连接相连。所述存储器用于存储程序,所述处理器用于执行所述存储器中的程序以及配合收发器,使得所述装置完成第一方面或第一方面的任意可能的实现方式中的由OSS执行的方法。
第六方面,本申请提供了一种处理故障的装置,所述装置包括收发器、处理器和存储器。其中,所述收发器、所述处理器以及所述存储器之间可以通过内部连接相连。所述存储器用于存储程序,所述处理器用于执行所述存储器中的程序以及配合收发器,使得所述装置完成第二方面或第二方面的任意可能的实现方式中的由EMS执行的方法。
第七方面,本申请提供了一种计算机程序产品,所述计算机程序产品包括在计算机可读存储介质中存储的计算机程序,并且所述计算程序通过设备进行加载来实现上述第一方面、第二方面、第一方面任意可能的实现方式或第二方面任意可能的实现方式的方法的指令。
第八方面,本申请提供了一种计算机可读存储介质,用于存储计算机程序,所述计算机程序通过设备进行加载来执行上述第一方面、第二方面、第一方面任意可能的实现方式或第二方面任意可能的实现方式的方法的指令。
第九方面,本申请提供了一种处理故障的系统,所述系统包括第三方面所述的装置和第四方面所述的装置,或者,所述系统包括第五方面所述的装置和第六方面所述的装置。
附图说明
图1是本申请实施例提供的一种网络架构示意图;
图2是本申请实施例提供的一种处理故障的方法流程图;
图3是本申请实施例提供的另一种处理故障的方法流程图;
图4是本申请实施例提供的一种处理故障的装置结构示意图;
图5是本申请实施例提供的另一种处理故障的装置结构示意图;
图6是本申请实施例提供的另一种处理故障的装置结构示意图;
图7是本申请实施例提供的另一种处理故障的装置结构示意图;
图8是本申请实施例提供的一种处理故障的系统结构示意图。
具体实施方式
下面将结合附图对本申请实施方式作进一步地详细描述。
参见图1,本申请实施例提供了一种网络架构,包括:
运营支撑系统(operations support system,OSS)、多个EMS和多个网元,每个EMS与 OSS通信,以及每个EMS与一个或多个网元通信。
OSS用于生成至少一个故障处理规则,向与该OSS通信的EMS发送该至少一个故障处理规则。
其中,对于该至少一个故障处理规则中的每个故障处理规则,该故障处理规则用于处理网元产生的故障,每个故障处理规则处理的故障可能不同。
对于任一个用于处理故障的故障处理规则,该故障处理规则包括类型信息集合、第一类型关系数据集合、根因告警信息、处理建议信息和时间长度。
其中,类型信息集合包括多个类型信息,该多个类型信息是该故障引起的多个异常事件的类型信息。该故障引起的各异常事件的产生时间的时间跨度小于或等于该时间长度。第一类型关系数据集合包括至少一条类型关系数据,每条类型关系数据包括该类型信息集合中的两个类型信息和该两个类型信息之间的关系。根因告警信息包括根因告警事件的类型信息,根因告警事件是该故障直接引起的异常事件,根因告警信息可用于反映出现该故障的原因。处理建议信息用于指示EMS或OSS如何对该故障进行处理。
对于每个EMS,以及对于与该EMS通信的一个或多个网元,该EMS用于根据该至少一个故障处理规则对该一个或多个网元出现的故障进行处理,以使出现故障的网元恢复正常。
其中,对于该多个网元中的任意网元,该网元在出现故障时可能产生异常事件,与该网元相关联的其他网元也可能受该故障影响产生异常事件。例如对于正在通信的两个终端,该两个终端之间的业务路径经过多个网元,该业务路径经过的多个网元相互关联,其中一个网元故障并产生异常事件,其他网元也可能受影响,也可能产生异常事件。
对于任一个出现故障的网元,为了便于说明,将该出现故障的网元称为第一网元。在第一网元出现故障时,第一网元产生并上报至少一个异常事件。该至少一个异常事件均是由该故障直接或间接引起的。
第一网元出现故障可能是第一网元中的某个部件出现故障。第一网元产生的异常事件包括至少一个告警事件,该至少一个告警事件包括由该故障直接引起的告警事件,还可能包括由该故障间接引起的告警事件。由该故障直接引起的告警事件又可称为根因告警事件,根因告警事件可用于反映该故障出现的原因。
可选的,第一网元产生的异常事件还可能包括至少一个性能越限事件和/或至少一个网元异常日志事件等。其中,性能越限事件是第一网元的性能参数的参数值超过参数值阈值时产生的与该性能参数相对应的事件。所谓性能参数的参数值超过参数值阈值是指该性能参数的参数值大于该参数值阈值,或者,该性能参数的参数值小于该参数值阈值。
例如,假设第一网元的某个端口出现故障,则第一网元直接产生的告警事件为端口告警事件,即该端口告警事件为该故障直接引起的告警事件,因此该端口告警事件是根因告警事件。由于该端口故障,可能导致经过该端口的链路无法承载业务,第一网元可能还产生链路故障告警事件。该链路故障告警事件就是由该故障间接引起的告警事件。
由于第一网元无法在该故障的端口上收发数据,导致该端口的传输速率急剧下降,在该端口的传输速率低于速率阈值时,第一网元可能产生传输速率越限事件,该传输速率越限事件是一种性能越限事件。
在该端口出现故障后,第一网元可能产生该故障的日志,第一网元产生的日志可能包括端口异常日志,即第一网元可能产生端口异常日志事件,该端口异常日志事件是网元异常日 志事件。所以在第一网元中的端口出现故障时,第一网元产生的异常事件包括端口告警事件、链路故障告警事件、传输速率越限事件和端口异常日志事件。
需要说明的是:在该端口出现故障后,第一网元可能产生一条日志,也可能产生多条日志。第一网元产生的每条日志可能对应一个网元异常日志事件,所以第一网元可能产生一个或多个网元异常日志事件。上述列举了第一网元产生端口异常日志事件,第一网元也可能不会产生其他网元异常日志事件,或者,第一网元还可能产生其他网元异常日志事件。例如第一网元还可能产生链路异常日志,即第一网元还可能产生链路异常日志事件。其中,需要说明的是:经过该端口的链路可能有一条或多条,在经过该端口的链路为多条的情况下,第一网元可能产生一个链路异常日志事件,该链路异常日志事件与该多条链路相对应,或者,第一网元也可能产生多个链路异常日志事件,每个链路异常日志事件与一条链路相对应。
对于与第一网元通信的其他网元,该其他网元也可能受第一网元出现的故障影响,也可能产生至少一个异常事件。与第一网元通信,并受第一网元出现的故障影响的其他网元可以称为第二网元。
例如,仍以上述第一网元中的端口出现故障为例,第二网元中的端口与第一网元的该故障端口之间存在链路,导致第二网元从该链路上无法接收到第一网元发送的数据,第二网元在检测到第一网元中的该故障端口阻塞,第二网元产生远端端口阻塞告警事件。由于第二网元无法通过该链路接收第一网元发送的数据,第二网元可能会产生记录该情况的日志,第二网元产生的日志可能包括链路异常日志,即第二网元产生链路异常日志事件。
其中,同第一网元一样,第二网元可能产生一条日志,或者,可能产生多条日志。第二网元产生的每条日志可能对应一个网元异常日志事件,所以第二网元可能产生一个或多个网元异常日志事件。例如,第二网元中的端口与第一网元的该故障端口之间可能存在多条链路,第二网元可能产生与该多条链路相对应的一个链路异常日志事件,或者,产生与每条链路相对应的链路异常日志事件。
再例如,在第一网元的风扇故障时,第一网元产生风扇故障告警事件,该风扇故障告警事件是该风扇故障直接引起的异常事件,为根因告警事件。由于第一网元的风扇故障,导致第一网元的单板工作温度上升,在单板工作温度超过温度阈值时,第一网元产生工作温度越限事件(越限即超过限额,可以是大于最大限额或小于最小限额)。该工作温度越限事件是该风扇故障间接引起的异常事件。第一网元还可能产生该风扇故障的日志,使得第一网元可能产生风扇异常日志事件,该风扇异常日志事件也是该风扇故障间接引起的异常事件。
对于第二网元中与第一网元中的该单板相连(即通信连接)的端口,第二网元通过该端口接收数据的误码率变高。在误码率高于误码率阈值时,第二网元产生误码率越限事件。
在该例子中,第二网元除了产生误码率越限事件外,还可能其他异常事件,如还可能告警事件和/或网元异常日志事件等。例如,在实现时,第二网元还可能产生误码率异常日志,使得第二网元产生误码率异常日志事件。
对于每个网元,该网元在产生异常事件后,向与该网元通信的EMS发送产生的异常事件的事件信息。其中,异常事件的事件信息包括该异常事件的类型信息、该异常事件的产生时间和该异常事件所在的对象的对象信息。
该异常事件的类型信息包括该异常事件属于的事件类型(比如告警类型,性能越限类型或网元异常日志类型等)和该异常事件属于该事件类型的子类型。
该异常事件的事件信息还可以包括其他信息。例如,网元异常日志事件的事件信息还可以包括与该网元异常日志事件相对应的日志内容。
告警类型的异常事件的事件名称为告警事件。告警类型包括至少一个子类型,例如属于告警类型的子类型包括端口告警类型、链路故障告警类型、远端端口阻塞告警类型和/或风扇故障告警类型等。告警类型的异常事件包括与告警类型的子类型对应事件,例如端口告警事件(对应于端口告警类型)、链路故障告警事件(对应于链路故障告警类型)、远端端口阻塞告警事件(对应于远端端口阻塞告警类型)和/或风扇故障告警事件(对应于风扇故障告警类型)等。
性能越限类型的异常事件的事件名称为性能越限事件。性能越限类型包括至少一个子类型,例如属于性能越限类型的子类型包括传输速率越限类型和/或误码率越限类型等。性能越限类型的性能越限事件包括与性能越限类型的子类型对应的事件,例如传输速率越限事件(对应于传输速率越限类型)和/或误码率越限事件(对应于误码率越限类型)等。
网元异常日志类型的异常事件的事件名称为网元异常日志事件。网元异常日志类型包括至少一个子类型,例如属于网元异常日志类型的子类型包括链路异常日志类型、误码率异常日志类型和/或风扇异常日志类型等。网元异常日志类型的异常事件包括与网元异常日志类型的子类型对应事件,例如链路异常日志事件(对应于链路异常日志类型)、误码率异常日志事件(对应于误码率异常日志类型)和/或风扇异常日志事件(对应于风扇异常日志类型)等。
异常事件所在的对象是产生该异常事件的对象,该对象的对象信息包括该对象的对象标识和/或该对象所在网元的网元标识,还可能包括该对象的描述信息。该对象可能是一个网元,网元中的部件或网元上建立的链路在该网元侧的端点等。
例如,对于上第一网元产生的端口告警事件,该端口告警事件所在的对象为第一网元中的该端口,该对象的对象信息包括该端口的端口标识和第一网元的网元标识。该对象的对象信息还可能包括该端口的描述信息,该描述信息可以用于描述该端口在第一网元中的位置,例如该描述信息包括该端口在第一网元中的槽位位置。
对于EMS,该EMS接收与其通信的至少一个网元发送的多个异常事件的事件信息,基于该多个异常事件的事件信息生成故障信息,向OSS发送故障信息,以使OSS可以基于故障信息生成故障处理规则。其中,EMS生成故障信息以及OSS生成故障处理规则的详细过程,将在后续实施例进行详细说明。
参见图2,本申请实施例提供了一种处理故障的方法,该方法可以应用图1所示的网络架构,在该方法中,EMS生成故障信息,该故障信息包括类型信息集合、对象关系数据集合、根因告警信息和处理建议信息,OSS接收EMS发送的故障信息,并基于接收的故障信息生成故障处理规则。该方法包括:
步骤201:EMS接收至少一个网元发送的多个异常事件的事件信息。
该至少一个网元中的每个网元与EMS通信。例如,该每个网元与EMS之间存在网络连接。
对于该至少一个网元中的任一个网元,为了便于说明称该网元为第一网元,在第一网元出现故障时,第一网元可能产生至少一个异常事件,以及第二网元也可能产生至少一个异常事件,第二网元是该至少一个网元中除第一网元之外的其他网元。
第一网元向EMS发送第一网元产生的每个异常事件的事件信息,第二网元也向EMS发送第二网元产生的每个异常事件的事件信息。
EMS接收多个异常事件的事件信息,该多个异常事件包括至少一个告警事件。对于任一个告警事件,该告警事件的事件信息包括该告警事件的类型信息,该告警事件的产生时间和该告警事件所在的对象的对象信息。该告警事件的类型信息包括告警类型和该告警事件属于的告警类型的子类型,该告警事件所在的对象是产生该告警事件的对象,该对象的对象信息包括该对象的对象标识和该对象所在网元的网元标识,还可能包括该对象的描述信息。
可选的,该告警事件的事件信息还可能包括该告警事件的事件标识和/或告警级别等信息,该事件标识可以在EMS中标识该告警事件。
例如,仍以图1所示实施例中列举的第一网元的端口出现故障为例,在第一网元的端口故障时,导致第一网元产生的告警事件包括端口告警事件和链路故障告警事件,第二网元产生的告警事件包括远端端口阻塞告警事件。因此参见下表1,EMS接收的告警事件的事件信息包括端口告警事件的事件信息、链路故障告警事件的事件信息和远端端口阻塞告警事件的事件信息。在表1中端口标识Port1对应的端口P1是第一网元中的故障端口。端口标识Port2对应的端口P2是第二网元中的与该故障端口P1相连的端口。第一链路端点是链路1在第一网元侧的端点,链路1是在第一网元的该故障端口P1和第二网元中的该端口P2之间建立的链路,链路1的链路标识Link1,第一链路端点的端点标识为Link1-1且位于该故障端口P1中。端口告警事件所在的对象是端口P1,链路故障告警事件所在的对象是第一链路端点,远端端口阻塞事件所在的对象是端口P2。
表1
Figure PCTCN2021103583-appb-000001
该多个异常事件还可能包括至少一个性能越限事件,对于任一个性能越限事件,该性能越限事件的事件信息包括该性能越限事件的类型信息,该性能越限事件的产生时间和该性能越限事件所在的对象的对象信息。该性能越限事件的类型信息包括性能越限类型和该性能越限事件属于的性能越限类型的子类型。可选的,该性能越限事件的事件信息还可能包括该性 能越限事件的事件标识和/或性能参数的参数值等信息,该事件标识可以在EMS中标识该性能越限事件。
例如,在第一网元的端口故障时,导致第一网元产生的性能越限事件包括传输速率越限事件。因此参见下表2,EMS接收的性能越限事件的事件信息包括传输速率越限事件的事件信息,传输速率越限事件所在的对象也为端口P1。
表2
Figure PCTCN2021103583-appb-000002
该多个异常事件还可能包括至少一个网元异常日志事件,对于任一个网元异常日志事件,该网元异常日志事件的事件信息包括该网元异常日志事件的类型信息,该网元异常日志事件的产生时间和该网元异常日志事件所在的对象的对象信息。该网元异常日志事件的类型信息包括网元异常日志类型和该网元异常日志事件属于的网元异常日志类型的子类型。可选的,该网元异常日志事件的事件信息还可能包括该网元异常日志事件的事件标识和/或日志内容等信息,该事件标识可以在EMS中标识该网元异常日志事件。
例如,假设在第一网元的端口故障时,第一网元产生的网元异常日志事件包括端口异常日志事件,第二网元产生的网元异常日志事件包括链路异常日志事件。因此参见下表3,EMS接收的网元异常日志事件的事件信息包括端口异常日志事件的事件信息和链路异常日志事件的事件信息,端口异常日志事件所在的对象为端口P1,链路异常日志事件所在的对象为第二链路端点,第二链路端点是链路1在第二网元侧的端点,第二链路端点位于第二网元的端口P2中。
表3
Figure PCTCN2021103583-appb-000003
步骤202:EMS根据多个异常事件的事件信息和EMS中的至少一个故障处理规则,确定同一故障引起的至少一个异常事件的事件信息。
其中,EMS可以在接收到异常事件的事件信息时就开始执行本步骤,或者,定期地执行本步骤。EMS在本步骤中使用的该多个异常事件的事件信息包括最新接收的异常事件的事件信息,还可能包括在之前生成故障信息时被排除的异常事件的事件信息,但该多个异常事件中的每个异常事件的产生时间与当前时间之间的时间差值小于指定的时间差阈值。
EMS存储有网络拓扑图和至少一个故障处理规则,该至少一个故障处理规则可能包括EMS接收OSS发送的故障处理规则和/或在EMS中配置的故障处理规则。OSS发送的故障处理规则是OSS生成的。
在步骤202中,EMS可以通过如下2021至2025的操作确定同一故障引起的至少一个异常事件的事件信息。该2021至2025的操作,分别为:
2021:EMS从该至少一个故障处理规则中选择一个故障处理规则作为目标故障处理规则。
目标故障处理规则包括类型信息集合、第一类型关系数据集合、根因告警信息、处理建议信息和时间长度。该类型信息集合包括至少一个子类型集合,该至少一个子类型集合包括告警类型对应的告警类型集合,该告警类型集合包括属于告警类型的至少一个子类型。
可选的,该至少一个子类型集合还可能包括性能越限类型对应的性能越限类型集合和/或网元异常日志类型对应的网元异常日志类型集合。该性能越限类型集合包括属于性能越限类型的至少一个子类型,以及该网元异常日志类型集合包括属于网元异常日志类型的至少一个子类型。
第一类型关系数据集合包括至少一条类型关系数据,每条类型关系数据包括两个类型信息和该两个类型信息之间的关系。
可选的,在实现时,该类型关系数据具体包括两个子类型和该两个子类型之间的关系,该两个子类型可能属于同一事件类型,或者,属于不同事件类型。比如,<端口告警类型,链路故障告警类型,包含关系>;表示,同属于告警类型的端口告警类型和链路故障告警类型,以及这两个告警类型的关系为包含关系。再比如,<端口告警类型,传输速率越限类型,同网元关系>:表示,端口告警类型和传输速率越限类型以及这两个子类型的关系为同网元关系,端口告警类型和传输速率越限类型属于不同事件类型。
根因告警信息包括根因告警事件的类型信息。可选的,根因告警信息包括根因告警事件的子类型。
接下来列举一个故障处理规则的例子,例如,EMS中保存如表4所示的故障处理规则。参见下表4,该故障处理规则中的类型信息集合包括三个子类型集合,该三个子类型集合分别为告警类型集合、性能越限类型集合和网元异常日志类型集合。告警类型集合包括的子类型为端口告警类型,链路故障告警类型和远端端口阻塞告警类型。性能越限类型集合包括的子类型为传输速率越限类型。网元异常日志类型集合包括的子类型为端口异常日志类型和链路异常日志类型。该故障处理规则中的第一类型关系数据集合包括的类型关系数据为<端口告警类型,链路故障告警类型,同网元关系>,<端口告警类型,传输速率越限类型,同网元关系>,……。该故障处理规则中的根因告警信息包括端口告警类型,是根因告警事件的子类型,根因告警事件为端口告警事件。该故障处理规则中的处理建议信息为重启根因告警事件所在 端口。该故障处理规则中的时间长度为50秒。
表4
Figure PCTCN2021103583-appb-000004
2022:EMS根据该多个异常事件中的每个异常事件的事件信息和目标故障处理规则包括的类型信息集合确定第一集合,第一集合包括该类型信息集合中的每个类型信息对应的异常事件。
该多个异常事件中的每个异常事件的事件信息包括类型信息,目标故障处理规则包括类型信息集合。在2022中,确定第一集合的过程可以为:确定该多个异常事件中是否存在该类型信息集合包括的每个类型信息对应的异常事件,如果存在,则获取第一集合,第一集合包括该类型信息集合中的每个类型信息对应的异常事件;如果不存在,则返回2021重新选择一个故障处理规则作为目标故障处理规则。
由于该类型信息集合包括至少一个子类型集合,每个子类型集合包括至少一个子类型,该多个异常事件中的每个异常事件的类型信息也包括子类型,所以在实现时,确定第一集合的过程可以为:确定该多个异常事件中是否存在该类型信息集合包括的每个子类型对应的异常事件,如果存在,则获取第一集合,第一集合包括该类型信息集合中的每个子类型对应的异常事件。
其中,该多个异常事件中可能存在不归属于第一集合中的异常事件,对于归属于第一集合中的异常事件,可以在下一次生成故障信息时使用这一部分异常事件。
例如,EMS接收到上述表1、表2和表3所示的六个异常事件,该六个异常事件分别为端口告警事件、链路故障告警事件、远端端口阻塞告警事件、传输速率越限事件、端口异常日志事件和链路异常日志事件,以及EMS存储有如表4所示的目标故障处理规则,目标故障处理规则中的类型信息集合包括六个子类型,该六个子类型为端口告警类型、链路故障告警类型、远端端口阻塞告警类型、传输速率越限类型、端口异常日志类型和链路异常日志类型。其中,表1、表2和表3所示的六个异常事件中包括端口告警类型对应的端口告警事件,链路故障告警类型对应的链路故障告警事件,远端端口阻塞告警类型对应的远端端口阻塞告警事件,传输速率越限类型对应的传输速率越限事件,端口异常日志类型对应的端口异常日志事件和链路异常日志类型对应的链路异常日志事件。因此得到的第一集合包括端口告警事件、链路故障告警事件、远端端口阻塞告警事件、传输速率越限事件、端口异常日志事件和链路异常日志事件。
2023:EMS根据第一集合和目标故障处理规则包括的时间长度确定第二集合。第二集合是第一集合的子集,第二集合包括的异常事件的产生时间的时间跨度小于或等于目标故障处理规则包括的时间长度。
其中,该时间跨度等于第二集合中的最早产生的异常事件的产生时间和最晚产生的异常事件的产生时间之间的时间差。第二集合是第一集合的子集的情况,包括第二集合属于第一集合,或者,第二集合和第一集合相同。
在步骤2023中,获取第一集合中的异常事件的产生时间的时间跨度,如果该时间跨度小于或等于目标故障处理规则包括的时间长度,将第一集合作为第二集合。如果该时间跨度大于目标故障处理规则包括的时间长度,从第一集合中去除最早产生异常事件或最晚产生的异常事件。然后,再获取第一集合中的剩余异常事件的产生时间的时间跨度,如果获取的时间跨度小于或等于目标故障处理规则包括的时间长度,将第一集合中的剩余异常事件组成第二集合。如果获取的时间跨度大于目标故障处理规则包括的时间长度,继续从第一集合剩余的异常事件中去除最早产生异常事件或最晚产生的异常事件,并继续重复上述过程,直至第一集合中剩余的异常事件的产生时间的时间跨度小于或等于目标故障处理规则中的时间长度为止。
例如,第一集合中的最早产生的异常事件为端口异常事件,最晚产生的异常事件为链路异常日志事件,根据端口异常事件的产生时间T1和链路异常日志事件的产生时间T6获取第一集合中的该六个异常事件的产生时间的时间跨度为T6-T1。假设T6-T1小于或等于50秒,所以将第一集合作为第二集合,第二集合也包括该六个异常事件,即如表5所示,第二集合包括端口告警事件、链路故障告警事件、远端端口阻塞告警事件、传输速率越限事件、端口异常日志事件和链路异常日志事件。
表5
第二集合包含的异常事件
端口告警事件
链路故障告警事件
远端端口阻塞告警事件
传输速率越限事件
端口异常日志事件
链路异常日志事件
2024:在第二集合包括目标故障处理规则中的类型信息集合中的每个类型信息对应的异常事件的情况下,EMS根据网络拓扑图和/或第二集合中的各异常事件的事件信息,获取第二类型关系数据集合。
其中,第二类型关系数据集合包括至少一条类型关系数据,每条类型关系数据包括两个类型信息和该两个类型信息之间的关系,该两个类型信息是第二集合中的两个异常事件的类型信息。
在第二集合没有包括目标故障处理规则中的某一个或多个类型信息对应的异常事件的情况下,返回2021重新选择一个故障处理规则作为目标故障处理规则。
在2024中,对于第二集合中的任意两个异常事件,为了便于说明,将该两个异常事件称为第一异常事件和第二异常事件,根据第一对象的对象标识和所在的网元标识,第二对象的对象标识和所在的网元标识,和/或,网络拓扑图,确定第一对象和第二对象之间是否存在至少一个指定关系。其中,第一对象是第一异常事件所在的对象,第二对象是第二异常事件所在的对象。如果第一对象和第二对象之间存在至少一个指定关系,从该至少一个指定关系中选择一个指定关系作为第一对象和第二对象之间的关系,该关系也是第一异常事件的类型信息与第二异常事件的类型信息之间的关系。从而可以得出一条类型关系数据以及一条对象关系数据,该类型关系数据包括第一异常事件的类型信息、第二异常事件的类型信息和该关系,该对象关系数据包括第一对象的对象信息、第二对象的对象信息和该关系。
可选的,该类型关系数据具体包括第一异常事件的子类型、第二异常事件的子类型和该关系,该对象关系数据具体包括第一对象的对象标识、第一对象所在网元的网元标识、第二对象的对象标识、第二对象所在网元的网元标识,以及该关系。
可选的,EMS从该至少一个指定关系中选择优先级最大或最小的一个指定关系作为第一对象和第二对象之间的关系。
可选的,指定关系可以为包含关系,承载关系,同业务关系,邻居关系或同网元关系等。在本申请实施例中定义了包含关系,承载关系,同业务关系,邻居关系和同网元关系的优先级,各关系的优先级不同。例如,接下来列举了一种各关系的优先级大小的实例,该实例为:包含关系>承载关系>同业务关系>邻居关系>同网元关系的优先级;各关系的优先级大小除了上述列举的实例外,还可能是其他实现实例,在此不再一一列举说明。
所谓包含关系是指第一对象包括第二对象或第二对象包括第一对象,例如,第一对象是单板,第二对象是单板中的端口,第一对象包括第二对象。所谓承载关系是指第一对象位于 第二对象上或第二对象位于第一对象上,例如,第一对象是物理链路,第二对象是逻辑链路,第二对象位于第一对象上。所谓同业务关系是指第一对象和第二对象用于传输相同业务。所谓邻居关系是指在第一对象和第二对象属于两个不同网元且两个网元之间直接相连。所谓同网元关系是第一对象和第二对象位于同一网元。
例如,对于第二集合中的任意两个异常事件,假设该两个异常事件为端口告警事件和链路故障告警事件,端口告警事件所在对象为端口P1,端口P1位于第一网元上,端口告警事件的对象标识为端口标识Port1;链路故障告警事件所在对象为第一链路端点,第一链路端点也位于第一网元上,链路故障告警事件的对象标识为端点标识Link1-1。所以EMS根据端口P1的对象标识Port1和所在的第一网元标识NE1,第一链路端点的对象标识Link1-1和所在的第一网元标识NE1,和/或,网络拓扑图,确定端口P1和第一链路端点之间是否存在至少一个指定关系。其中端口P1位于第一网元上,第一链路端点位于端口P1中,所以端口P1和第一链路端点之间存在包含关系和同网元关系,选择优先级最大的包含关系作为端口P1和第一链路端点之间的关系。从而可以得出一条类型关系数据以及一条对象关系数据,该类型关系数据为<端口告警类型,链路故障告警类型,包含关系>,该对象关系数据为<端口P1的对象信息,第一链路端点的对象信息,包含关系>。
可选的,EMS通过如下(1)至(4)操作,来确定第一对象和第二对象之间是否存在至少一个指定关系。该(1)至(4)的操作,可以为:
(1):EMS从第一异常事件的事件信息中获取第一异常事件的类型信息,第一对象的对象标识和所在的网元标识,从第二异常事件的事件信息获取第二异常事件的类型信息,第二对象的对象标识和所在的网元标识。
例如,仍以上述第二集合中包括的端口告警事件和链路故障告警事件为例,参见表1,端口告警事件的事件信息中包括端口告警类型,端口P1的端口标识Port1(端口P1为对象,端口标识Port1为对象标识),以及端口P1所在的网元标识NE1;链路故障告警事件的事件信息包括链路故障告警类型、第一链路端点的端点标识Link1-1(第一链路端点为对象,端点标识Link1-1为对象标识),以及第一链路端点所在的网元标识NE1。因此,EMS从端口告警事件的事件信息中获取端口告警事件的端口告警类型、端口P1的端口标识Port1和端口P1所在的网元标识NE1,以及从链路故障告警事件的事件信息中获取链路故障告警事件的链路故障告警类型,第一链路端点的端点标识Link1-1和第一链路端点所在的网元标识NE1。
再例如,再以上述第二集合包括的端口告警事件和链路异常日志事件为例,参见表3,链路异常日志事件的事件信息中包括链路异常日志类型,第二链路端点的端点标识Link1-2(第二链路端点为对象,端点标识Link1-2为对象标识),以及第二链路端点所在的网元标识NE2。因此,EMS从链路异常日志事件的事件信息中获取链路异常日志类型,第二链路端点的端点标识Link1-2和第二链路端点所在的网元标识NE2。
(2):EMS根据第一对象所在的网元标识和第二对象所在的网元标识,确定第一对象和第二对象是否位于同一网元,如果确定出位于同一网元,执行如下操作(3),如果确定出位于不同网元,执行如下操作(4)。
例如,EMS根据端口P1所在的网元标识NE1和第一链路端点所在的网元标识NE1,确定端口P1和第一链路端点位于第一网元,即二者位于同一网元,执行如下操作(3)。
再例如,EMS根据端口P1所在的网元标识NE1和第二链路端点所在的网元标识NE2, 确定端口P1和第二链路端点位于不同网元,执行如下操作(4)。
(3):EMS根据第一对象的对象标识和第二对象的对象标识,确定第一对象和第二对象之间存在的至少一个指定关系,该至少一个指定关系可能包括包含关系和/或同网元关系,结束。
EMS还在该至少一个指定关系中,选择优先级最大或最小的指定关系作为第一对象和第二对象之间的关系。
例如,对于上述列举的端口告警事件和链路故障告警事件,该端口告警事件所在的对象为端口P1,链路故障告警事件所在的对象为第一链路端点。根据第一链路端点的端点标识Link1-1和端口P1的端口标识Port1,确定端口P1包括第一链路端点,所以确定第一链路端点和端口P1之间存在包含关系和同网元关系,而包含关系的优先级高于同网元关系,所以确定第一链路端点和端口P1之间的关系为包含关系,如此得到一条对象关系数据,该对象关系数据为<第一链路端点的对象信息,端口P1的对象信息,包含关系>,以及得到一条类型关系数据,该类型关系数据为<端口告警类型,链路故障告警类型,包含关系>。
(4):EMS根据网络拓扑图、第一对象的对象标识和所在的网元标识,第二对象的对象标识和所在的网元标识,确定第一对象和第二对象之间是否存在至少一个指定关系,该至少一个指定关系可能包括邻居关系、同业务关系和/或承载关系。
如果第一对象和第二对象之间存在至少一个指定关系,EMS从该至少一个指定关系中,选择优先级最大或最小的指定关系作为第一对象和第二对象之间的关系。如果第一对象和第二对象之间不存在指定关系,表明第一对象和第二对象之间没有关系,以及第一异常事件的类型信息和第二异常事件的类型信息之间没有关系。
例如,对于上述列举的端口告警事件和链路异常日志事件,该端口告警事件所在的对象为端口P1,链路异常日志事件所在的对象为第二链路端点。端口P1位于第一网元,第二链路端点位于第二网元的端口P2中,第二链路端点与端口P1中的第一链路端点是同一条链路1的两个端点,所以EMS根据网络拓扑图、端口P1的端口标识Port1和所在第一网元的网元标识NE1,以及第二链路端点的端点标识Link1-2和所在第二网元的网元标识NE2,确定端口P1和第二链路端点之间存在邻居关系。将该邻居关系确定为端口P1和第二链路端点之间的关系,如此得到一条对象关系数据,该对象关系数据为<端口P1的对象信息、第二链路端点的对象信息,邻居关系>;以及,得到一条类型关系数据,该类型关系数据为<端口告警类型、远端端口阻塞告警类型,邻居关系>。
经过2024的操作后,不仅得到第二类型关系数据集合,还可以得到对象关系数据集合。例如,对于第二集合包括的端口告警事件、链路故障告警事件、远端端口阻塞告警事件、传输速率越限事件、端口异常日志事件和链路异常日志事件。对第二集合中的任意两个异常事件均执行上述过程后,得到如下表6所示的第二类型关系数据集合和如下表7所示的对象关系数据集合。
表6
第二类型关系数据集合
<端口告警类型,链路故障告警类型,包含关系>
<端口告警类型,传输速率越限类型,同网元关系>
<链路异常日志类型,远端端口阻塞告警类型,包含关系>
<端口告警类型,远端端口阻塞告警类型,邻居关系>
<链路故障告警类型,远端端口阻塞告警类型,邻居关系>
<传输速率越限类型,远端端口阻塞告警类型,邻居关系>
<端口异常日志类型,远端端口阻塞告警类型,邻居关系>
<端口告警类型,链路异常日志类型,邻居关系>
<链路故障告警类型,链路异常日志类型,同业务关系>
<传输速率越限类型,链路异常日志类型,邻居关系>
<端口异常日志类型,链路异常日志类型,邻居关系>
表7
对象关系数据集合
<端口P1的对象信息,第一链路端点的对象信息,包含关系>
<端口P1的对象信息,端口P2的对象信息,邻居关系>
<端口P1的对象信息,第二链路端口的对象信息,邻居关系>
<第一链路端点的对象信息,端口P2的对象信息,邻居关系>
<第一链路端点的对象信息,第二链路端点的对象信息,同业务关系>
<端口P2的对象信息,第二链路端点的对象信息,包含关系>
2025:在第二类型关系数据集合包括目标故障处理规则中的第一类型关系数据集合时,确定第二集合中的各异常事件的事件信息是同一故障引起的异常事件的事件信息。
其中,在第二类型关系数据集合不包括第一类型关系数据集合时,返回2021重新选择一个故障处理规则作为目标故障处理规则。
目标故障处理规则包括类型信息集合和第一类型关系数据集合,第二集合包括的异常事件是该类型信息集合中的每个类型信息对应的异常事件。在第二类型关系数据集合包括第一类型关系数据集合时,表示第一类型关系数据集合中的每条类型关系数据可以与第二集合中的两个异常事件匹配。
例如,表6所示的第二类型关系数据集合包括表4所示的目标故障处理规则中的第一类型关系数据集合。对于表4中的第一类型关系数据集合中的任一条类型关系数据,以类型关系数据<端口告警类型,链路故障告警类型,包含关系>为例,第二集合包括端口告警类型对应的端口告警事件和链路故障告警类型对应的链路故障告警事件,表6中所示的第二类型关系数据集合记录了该端口告警事件的端口告警类型和该链路故障告警事件的链路故障告警类型之间的关系也为包含关系,所以<端口告警类型,链路故障告警类型,包含关系>与第二集合中的端口告警事件和链路故障告警事件匹配。同理,该第一类型关系数据集合中的其他每条类型关系数据均与第二集合中的异常事件匹配,所以可以确定第二集合中的端口告警事件 的事件信息、链路故障告警事件的事件信息、远端端口阻塞告警事件的事件信息、传输速率越限事件的事件信息、端口异常日志事件的事件信息和链路异常日志事件的事件信息是同一故障引起的异常事件的事件信息。
步骤203:EMS向OSS发送故障信息,该故障信息包括该至少一个异常事件中的每个异常事件的产生时间和对象信息、该对象关系数据集合、目标故障处理规则中的类型信息集合、根因告警信息和处理建议信息。
至少一个异常事件是在步骤202中得到的第二集合中的异常事件。目标故障处理规则是在步骤202中得到的与该至少一个异常事件匹配的故障处理规则,该对象关系数据集合也是在步骤202中得出的对象关系数据集合。目标故障处理规则中的类型信息集合包括的类型信息是该至少一个异常事件中的每个异常事件的类型信息。
在步骤203中,EMS将该至少一个异常事件中的每个异常事件的产生时间、每个异常事件的对象信息、该对象关系数据集合、目标故障处理规则中的类型信息集合、根因告警信息和处理建议信息组成故障信息,向OSS发送该故障信息。
例如,在步骤202中,EMS确定第二集合中的端口告警事件的事件信息、链路故障告警事件的事件信息、远端端口阻塞告警事件的事件信息、传输速率越限事件的事件信息、端口异常日志事件的事件信息和链路异常日志事件的事件信息是同一故障引起的异常事件的事件信息,以及获取到如表7所示的对象关系数据集合。这样在步骤203中,EMS将端口告警事件的产生时间T1和对象信息、链路故障告警事件的产生时间T2和对象信息、远端端口阻塞告警事件的产生时间T3和对象信息、传输速率越限事件的产生时间T4和对象信息、端口异常日志事件的产生时间T5和对象信息、链路异常日志事件的产生时间T6和对象信息、如表6所示的对象关系数据集合、以及如表4所示的目标故障处理规则中的类型信息集合、根因告警信息和处理建议信息组成如下表8所示的故障信息。
表8
Figure PCTCN2021103583-appb-000005
其中,需要说明的是:对于该故障信息包括的类型信息集合、该至少一个异常事件中的每个异常事件的产生时间和对象信息,由于该类型信息集合包括的类型信息是该每个异常事件的类型信息,所以该类型信息集合、该每个异常事件的产生时间和对象信息共同构成该每个异常事件的事件信息。所以该故障信息实质包括该每个异常事件的事件信息、该对象关系数据集合、该根因告警信息和该处理建议信息。
例如,参见上述表8所示的故障信息,表8中包括端口告警事件的对象信息、产生时间T1和端口告警类型,链路故障告警事件的对象信息、产生时间T2和链路故障告警类型,远端端口阻塞告警事件的对象信息、产生时间T3和远端端口阻塞告警类型,传输速率超限事件的对象信息、产生时间T4和传输速率越限类型,端口异常日志事件的对象信息、产生时间T5和端口异常日志类型,以及链路异常日志事件的对象信息、产生时间T6和链路异常日志类型。所以表8所示的故障信息实质包括端口告警事件的事件信息,链路故障告警事件的事件信息,远端端口阻塞告警事件的事件信息,传输速率超限事件的事件信息,端口异常日志事件的事件信息,链路异常日志事件的事件信息、如表7所示的对象关系数据集合、以及如表4所示的目标故障处理规则中的根因告警信息和处理建议信息。
可选的,对于该至少一个异常事件中的任一个异常事件,在该异常事件的事件信息中除了包括该异常事件的类型信息、产生时间和对象信息外,还可能包括其他信息,该故障信息还可以包括该其他信息。例如,以上述端口告警事件为例,端口告警事件的事件信息中除了包括端口告警事件的类型信息、产生时间和对象信息外,还包括事件标识Event1和告警级别L1,所以表8所示的故障信息还可以包括该事件标识Event1和告警级别L1。
可选的,该故障信息还包括第二类型关系数据集合。
可选的,EMS还可以获取故障发生位置信息、故障开始时间和故障结束时间等中的一个或多个。该故障信息还可能包括故障发生位置信息、故障开始时间和故障结束时间等中的一个或多个。
故障发生位置信息包括产生根因告警事件的网元的位置。可选的,故障发生位置信息可以为该网元的地理位置。例如,可以为该网元的经纬度坐标或该网元所在地的名称等。
故障开始时间可以为根因告警事件的产生时间。另外,在网元中的故障消失后,网元会通知EMS故障消失,所以该故障结束时间是EMS接收故障消失的通知的时间。
在EMS匹配出目标故障处理规则后,EMS还可以根据目标故障处理规则包括的处理建议信息,对网元产生故障进行处理。例如,假设目标故障处理规则为如表4所示的故障处理规则,根据目标故障处理规则的处理建议信息“重启根因告警事件所在端口”,重启第一网元中的端口P1。
其中,EMS可能确定多个故障引起的异常事件的事件信息,根据每个故障引起的异常事件的事件信息生成每个故障对应的故障信息,向OSS发送每个故障对应的故障信息。
其中,与OSS通信的EMS可能为多个,即可能有一个或多个EMS向OSS发送生成的故障信息。
步骤204:OSS接收多个故障信息,在该多个故障信息中,将包括相同类型信息集合、根因告警信息和处理建议信息的故障信息聚合成故障信息集合。
其中,该多个故障信息中的每个故障信息包括的类型信息集合、根因告警信息和处理建议信息可能均相同,所以OSS可能将该多个故障信息聚合成一个故障信息集合。或者,该多个故障信息中的每个故障信息包括的类型信息集合、根因告警信息和处理建议信息可能不均相同,所以OSS可能将该多个故障信息聚合成多个故障信息集合。
其中,OSS得到的每个故障信息集合与故障相对应。
例如,OSS接收的多个故障信息中包括如表8所示的故障信息,对于表8所示的故障信息包括的类型信息集合、根因告警信息和处理建议信息,OSS从接收的该多个故障信息中, 将包括该类型信息集合、该根因告警信息和该处理建议信息的各故障信息聚合成故障信息集合。所以该故障信息集合中的每条故障信息中的类型信息集合均包括端口告警类型、链路故障告警类型、远端端口阻塞告警类型、传输速率越限类型、端口异常日志类型和链路异常日志类型。该故障信息集合中的每条故障信息包括的根因告警信息均是端口告警类型,该故障信息集合中的每条故障信息包括的处理建议信息均是重启根因告警事件所在端口。
其中,需要说明的是:OSS在接收到故障信息后,基于故障信息包括对象关系数据集合绘制对象拓扑图,显示该对象拓扑图,该故障信息包括的根因告警信息、处理建议信息、每个异常事件的产生时间、对象信息和类型信息等内容,以供运维人员查看。
可选的,OSS根据该对象拓扑图,该故障信息包括的根因告警信息、处理建议信息、每个异常事件的类型信息、产生时间和/或对象信息等内容,确定EMS做根因定位的过程,和/或,确定需要维修的对象和该对象所在的网元,以提示运维人员对该对象进行维修。
步骤205:对于该故障信息集合中的每条故障信息,OSS基于该故障信息包括的对象关系数据集合和各异常事件的对象信息,获取该故障信息对应的第二类型关系数据集合。
该故障信息集合是在步骤202中聚合得到的一个或多个故障信息集合中的任意一个故障信息集合。
在步骤205中,对于该故障信息包括的任意两个异常事件,为了便于说明,将该两个异常事件称为第一异常事件和第二异常事件。如果对象关系数据集合存在包括第一异常事件的对象信息和第二异常事件的对象信息的对象关系数据,将该对象关系数据包括的关系作为第一异常事件的类型信息和第二异常事件的类型信息之间的关系,从而得到一条类型关系数据,该类型关系数据为包括第一异常事件的类型信息、第二异常事件的类型信息和该关系。如果对象关系数据集合不存在包括第一异常事件的对象信息和第二异常事件的对象信息的对象关系数据,得出第一异常事件的类型信息和第二异常事件的类型信息之间没有关系。对该故障信息中的其他任意两个异常事件重复上述过程,从而得到多条不同的类型关系数据,该多条类型关系数据构成该故障信息对应的第二类型关系数据集合。
例如,以表8所示的故障信息为例,对于表8包括的任意两个故障事件,假设该两个故障事件为端口告警事件和链路故障告警事件,端口告警事件所在的对象为端口P1,链路故障告警事件所在的对象为第一链路端点。表8包括的对象关系数据集合中存在一条对象关系数据<端口P1的对象信息,第一链路端点的对象信息,包含关系>,也就是说表8包括的对象关系数据集合中存在包括端口告警事件的对象信息和链路故障告警事件的对象信息的对象关系数据。因此,得到一条类型关系数据,该类型关系数据包括端口告警事件的类型信息、链路故障告警事件的类型信息和包含关系,该类型关系数据可以表示为<端口告警类型、链路故障告警类型,包含关系>。对表8中的其他任意两个异常事件重复上述过程,最后得到如表6所示的第二类型关系数据集合。
其中,步骤205是一个可选的步骤,对于故障信息集合中的任意一条故障信息,在该故障信息中包括第二类型关系数据集合的情况下,就可以不需要执行步骤205的操作。
步骤206:OSS基于该故障信息包括的各异常事件的产生时间,获取该故障信息对应的时间跨度,该时间跨度为该各异常事件的产生时间的时间跨度。
在步骤206中,OSS从该故障信息中读取最晚产生的异常事件的产生时间和最早产生的异常事件的产生时间,基于读取的两个产生时间确定该故障信息对应的时间跨度。
例如,对于表8所示的故障信息,OSS从该故障信息中读取最晚产生的链路异常日志事件的产生时间T6和最早产生的端口告警事件的产生时间T1,基于产生时间T6和T1确定该故障信息对应的时间跨度为T6-T1。
可选的,上述步骤205和步骤206之间没有先后执行顺序,也就是说,OSS可以先执行步骤205,再执行步骤206;或者,OSS同时执行步骤205和步骤206;或者,OSS先执行步骤206,再执行步骤205。
步骤207:OSS对该故障信息集合中的故障信息对应的第二类型关系数据集合取交集得到第三类型关系数据集合,以及从该故障信息集合中的故障信息对应的时间跨度中选择最大时间跨度。
在步骤207中,OSS可以通过如下2071至2073的操作来实现,该2071至2073的操作分别为:
2071:OSS将该故障信息集合划分为多个子集合,每个子集合包括的故障信息对应的第二类型关系数据集合和时间跨度均相同。
在2071中OSS可以按第二类型关系数据集合和时间跨度,对该故障信息集合中的故障信息进行归类,将对应相同的第二类型关系数据集合和时间跨度归类为一类并组成一个子集合。
2072:OSS从该划分的子集合中选择满足第一指定条件的多个子集合,第一指定条件是该多个子集合包括的故障信息总数目与该故障信息集合包括的故障信息总数目之间的比例超过指定比例。
在2072中选择子集合的过程可以为:OSS从该划分的子集合中选择包括故障信息数目最多的m个子集合,m为大于或等于2的整数。计算该m个子集合包括的故障信息总数目与该故障信息集合包括的故障信息总数目之间的比例,在该比例超过指定比例,执行如下2073的操作。
在该比例未超过指定比例,OSS从剩余未选择的子集合中选择包括的故障信息数目最多的n个子集合,n为大于或等于1的整数,此时总共选择m+n个子集合。计算该m+n个子集合包括的故障信息总数目与该故障信息集合包括的故障信息总数目之间的比例,在该比例超过指定比例,执行如下2073的操作。在该比例未超过指定比例,从剩余未选择的子集合中选择包括的故障信息数目最多的n个子集合,此时总共选择m+2n个子集合,再执行上述计算比例和选择子集合的操作,直至选择的子集合包括的故障信息总数目与故障信息集合包括的故障信息总数目之间的比例超过指定比例为止。
该指定比例可以为0.9、0.8或0.7等。
2073:OSS对选择的子集合中的每个故障信息对应的第二类型关系数据集合取交集得到第三类型关系数据集合,以及从选择的子集合中的每个故障信息对应的时间跨度中选择最大时间跨度。
其中,该最大时间跨度是该故障信息集合对应的故障引起的异常事件的持续时长。
步骤208:OSS向至少一个EMS发送故障处理规则,该故障处理规则包括第三类型关系数据集合、该最大时间跨度、第一故障信息包括的类型信息集合、根因告警信息和处理建议信息,第一故障信息是该故障信息集合中的任意一条故障信息。
在步骤208中,OSS将第三类型关系数据集合、该最大时间跨度、第一故障信息包括的 类型信息集合、根因告警信息和处理建议信息组成故障处理规则,向与该OSS通信的至少一个EMS发送该故障处理规则。
第三类型关系数据集合中的每条类型关系数据包括两个类型信息,OSS判断第三类型关系数据集合包括的类型信息是否包括第一故障信息中的类型信息集合中的全部类型信息。如果包括该类型信息集合中的全部类型信息,则将第三类型关系数据集合、该最大时间长度、第一故障信息包括的类型信息集合、根因告警信息和处理建议信息组成故障处理规则。如果没有包括该类型信息集合中的全部类型信息,则丢弃第三类型关系数据集合、该最大时间跨度和该故障信息集合。
在步骤204中可能聚合得到多个故障信息集合,OSS对每个故障信息集合执行上述步骤205至208的操作,可能得到一个或多个故障处理规则,然后向与该OSS通信的至少一个EMS发送该一个或多个故障处理规则。
OSS向与该OSS通信的EMS发送故障处理规则后,EMS接收该故障处理规则,并保存该故障处理规则。之后,还可以使用该故障处理规则匹配由同一故障引起的异常事件的事件信息,和/或,处理网元出现的故障,即EMS继续从上述步骤201开始执行。
可选的,EMS保存该故障处理规则的操作,可以为:对于该故障处理规则中的类型信息集合、根因告警信息和处理建议信息,在EMS本地保存有包括该类型信息集合、该根因告警信息和该处理建议信息的故障处理规则时,将EMS本地保存的包括该类型信息集合、该根因告警信息和该处理建议信息的故障处理规则更新为该故障处理规则。在EMS本地没有保存包括该类型信息集合、该根因告警信息和该处理建议信息的故障处理规则时,直接保存该故障处理规则。
在本申请实施例中,EMS接收多个异常事件的事件信息,使用EMS中包括的故障处理规则识别由同一故障引起的每个异常事件的事件信息,基于该每个异常事件的事件信息,得到故障信息,该故障信息包括由该每个异常事件的类型信息、每个异常事件的产生时间、每个异常事件的对象信息、对象关系数据集合、该故障处理规则中的根因告警信息和处理建议信息。由于该故障信息包括每个异常事件的类型信息、对象信息和对象关系数据集合,从而OSS可以根据每个异常事件的类型信息、对象信息和对象关系数据集合可以得出不同异常事件的类型信息之间的关系,进而可以生成故障处理规则,将该故障处理规则发送给不同的EMS。由于不同EMS均可接收用于处理故障的故障处理规则,从而不同的EMS可以对该故障进行处理。由于OSS接收不同EMS发送的多个故障信息,可以基于大量故障信息自动生成故障处理规则,使得故障处理规则具有可学习性和更新性,提高了生成故障处理规则的精度,以及提高了生成故障处理规则的效率。OSS将生成的故障处理规则发送给不同的EMS,这样可以提高在EMS配置故障处理规则的效率和实时性,也保证不同的EMS可以基于该故障处理规则,处理与该故障处理规则相对应的故障。
参见图3,本申请实施例提供了一种处理故障的方法,该方法可以应用图1所示的网络架构,在该方法中,EMS生成故障信息,该故障信息包括类型信息集合、第二类型关系数据集合、根因告警信息、处理建议信息和时间跨度,OSS接收EMS发送的故障信息,并根据接收的故障信息生成故障处理规则。该方法包括:
步骤301-302:分别与步骤201-202相同,在此不再详细说明。
经过步骤302,EMS可以得出由同一故障引起的至少一个异常事件的事件信息、第二类型关系数据信集合以及与该至少一个异常事件匹配的目标故障处理规则。
例如,仍以EMS接收到上述表1、表2和表3所示的端口告警事件、链路故障告警事件、远端端口阻塞告警事件、传输速率越限事件、端口异常日志事件和链路异常日志事件为例,EMS对该六个异常事件经过上述步骤301至步骤302的处理后,得出该六个异常事件是由同一故障引起的异常事件,以及得出如表4所示的与该六个异常事件匹配的目标故障处理规则和如表6所示的第二类型关系数据集合。
步骤303:EMS向OSS发送故障信息,该故障信息包括时间跨度、第二类型关系数据集合、目标故障处理规则中的类型信息集合、根因告警信息和处理建议信息,该时间跨度是该至少一个异常事件中的每个异常事件的产生时间的时间跨度。
在步骤303中,EMS基于该至少一个异常事件中的每个异常事件的产生时间获取时间跨度,将该时间跨度、第二类型关系数据集合、目标故障处理规则中的类型信息集合、根因告警信息和处理建议信息组成故障信息,向OSS发送该故障信息。
例如,EMS基于端口告警事件的产生时间T1、链路故障告警事件的产生时间T2、远端端口阻塞告警事件的产生时间T3、传输速率越限事件的产生时间T4、端口异常日志事件的产生时间T5和链路异常日志事件的产生时间T6,获取时间跨度为T6-T1。将该时间跨度T6-T1、如表6所示的第二类型关系数据集合、如表4所示的目标故障处理规则中的类型信息集合、根因告警信息和处理建议信息组成如下表9所示的故障信息。
表9
Figure PCTCN2021103583-appb-000006
其中,由于故障信息中包括的事件类型集合的数据量小于该至少一个异常事件的事件信息的数据量,所以本步骤中的故障信息的数据量较小,可以减小故障信息的数据量,从而可以减小对网络资源的占用。
其中,EMS可能确定多个故障引起的异常事件的事件信息,根据每个故障引起的异常事件的事件信息生成每个故障对应的故障信息,向OSS发送每个故障对应的故障信息。
其中,与OSS通信的EMS可能为多个,即可能有一个或多个EMS向OSS发送生成的故障信息。
步骤304:OSS接收多个故障信息,在该多个故障信息中,将包括相同类型信息集合、 根因告警信息和处理建议信息的故障信息聚合成第一故障信息集合。
OSS聚合得到第一故障信息集合的详细实现过程,可以参见图2所示实施例中的步骤204中的相关内容,在此不再详细说明。
其中,需要说明的是:OSS接收的多个故障信息中可能存在部分故障信息完全相同。
步骤305:OSS对第一故障信息集合中的故障信息包括的第二类型关系数据集合取交集得到第三类型关系数据集合,以及从该故障信息集合中的故障信息包括的时间跨度中选择最大时间跨度。
在步骤305中,OSS可以通过如下3051至3053的操作来实现,该3051至3053的操作分别为:
3051:对于第一故障信息集合中的任意相同的多条故障信息,OSS统计该多条故障信息的信息条数,在第一故障信息集合中进行去重处理得到第二故障信息集合,使得在第二故障信息集合中保留的每条故障信息不同。
其中,去重后得到的第二故障信息集合中的每条故障信息对应一个信息条数。
3052:OSS从去重后的故障信息集合中选择满足第二指定条件的多个故障信息,第二指定条件是该多个故障信息对应的信息条数累加值与第二故障信息集合包括的每条故障信息对应的信息条数的总累加值之间的比例超过指定比例。
在3052中选择多个故障信息的过程可以为:OSS从第二故障信息集合中选择对应信息条数最多的m个故障信息,m为大于或等于2的整数。计算该m个故障信息对应的信息条数的累加值与第二故障信息集合包括的每条故障信息对应的信息条数的总累加值之间的比例,在该比例超过指定比例,执行如下3053的操作。
在该比例未超过指定比例,OSS从第二故障信息集合中剩余未选择的故障信息中选择对应信息条数最多的n个故障信息,n为大于或等于1的整数,此时总共选择m+n个故障信息。计算该m+n个故障信息对应的信息条数的累加值与该总累加值之间的比例,在该比例超过指定比例,执行如下2053的操作。在该比例未超过指定比例,从第二故障信息集合中剩余未选择的故障信息中选择对应信息条数最多的n个故障信息,此时总共选择m+2n个故障信息,再执行上述计算比例和选择故障信息的操作,直至选择的多个故障信息满足第二指定条件时为止。
3053:OSS对选择的每个故障信息包括的第二类型关系数据集合取交集得到第三类型关系数据集合,以及从选择的每个故障信息包括的时间跨度中选择最大时间跨度。
其中,该最大时间跨度是该故障信息集合对应的故障引起的异常事件的持续时长。
步骤306:OSS向至少一个EMS发送故障处理规则,该故障处理规则包括第三类型关系数据集合、该最大时间跨度、第一故障信息包括的类型信息集合、根因告警信息和处理建议信息,第一故障信息是第二故障信息集合中的任意一条故障信息。
在步骤306中,OSS将第三类型关系数据集合、该最大时间跨度、第一故障信息包括的类型信息集合、根因告警信息和处理建议信息组成故障处理规则,向与该OSS通信的至少一个EMS发送该故障处理规则。
第三类型关系数据集合中的每条类型关系数据包括两个类型信息,OSS判断第三类型关系数据集合包括的类型信息是否包括第一故障信息中的类型信息集合中的全部类型信息。如果包括该类型信息集合中的全部类型信息,则将第三类型关系数据集合、该最大时间长度、 第一故障信息包括的类型信息集合、根因告警信息和处理建议信息组成故障处理规则。如果没有包括该类型信息集合中的全部类型信息,则丢弃第三类型关系数据集合、该最大时间跨度和该故障信息集合。
在步骤304中可能聚合得到多个故障信息集合,OSS对每个故障信息集合执行上述步骤305的操作,可能得到一个或多个故障处理规则,然后向与该OSS通信的至少一个EMS发送该一个或多个故障处理规则。
OSS向与该OSS通信的EMS发送故障处理规则后,EMS接收该故障处理规则,并保存该故障处理规则。之后,还可以使用该故障处理规则匹配由同一故障引起的异常事件的事件信息,和/或,处理网元出现的故障,即EMS继续从上述步骤301开始执行。
可选的,EMS保存该故障处理规则的操作,可以为:对于该故障处理规则中的类型信息集合、根因告警信息和处理建议信息,在EMS本地保存有包括该类型信息集合、该根因告警信息和该处理建议信息的故障处理规则时,将EMS本地保存的包括该类型信息集合、该根因告警信息和该处理建议信息的故障处理规则更新为该故障处理规则。在EMS本地没有保存包括该类型信息集合、该根因告警信息和该处理建议信息的故障处理规则时,直接保存该故障处理规则。
在本申请实施例中,EMS接收多个异常事件的事件信息,使用EMS中包括的故障处理规则识别由同一故障引起的异常事件的事件信息,根据该故障引起的异常事件的事件信息,生成故障信息,该故障信息包括类型信息集合、第二类型关系数据集合、根因告警信息和处理建议信息。由于该故障信息包括第二类型关系数据集合,OSS可以根据故障信息中的第二类型关系数据集合生成故障处理规则,将该故障处理规则发送给不同的EMS。由于不同EMS均可接收用于处理故障的故障处理规则,从而不同的EMS可以对该故障进行处理。由于OSS接收不同EMS发送的多个故障信息,可以基于大量故障信息自动生成故障处理规则,使得故障处理规则具有可学习性和更新性,提高了生成故障处理规则的精度,以及提高了生成故障处理规则的效率。OSS将生成的故障处理规则发送给不同的EMS,这样可以提高在EMS中配置故障处理规则的效率和实时性,也保证不同的EMS可以基于该故障处理规则,处理与该故障处理规则相对应的故障。
参见图4,本申请实施例提供了一种处理故障的装置400,所述装置400可以部署在上述图1、图2或图3所示实施例提供的OSS中,包括:
接收单元401,用于接收至少一个网元管理系统EMS发送的故障信息,该故障信息包括类型信息集合、第一关系集合、故障的根因告警信息、针对该故障的处理建议信息和由该故障引起的多个异常事件的时间信息,该类型信息集合包括该多个异常事件的类型信息,第一关系集合包括该多个异常事件之间的关系;
处理单元402,用于将包括相同类型信息集合、根因告警信息和处理建议信息的故障信息聚合成故障信息集合,将该故障信息集合中的每条故障信息中的第一关系集合取交集得到第二关系集合,以及根据该每条故障信息中的时间信息确定时间长度,该时间长度是产生该故障引起的异常事件的持续时长;
发送单元403,用于向至少一个EMS发送故障处理规则,该故障处理规则包括第二关系集合、该时间长度、第一故障信息中的类型信息集合、根因告警信息和处理建议信息,第一故障信息是该故障信息集合中的任意一条故障信息,该故障处理规则用于指示该至少一个 EMS处理该故障。
可选的,处理单元402取交集以及确定时间长度的详细实现过程,可以参见图2所示实施例中的步骤204至步骤207以及图3所示实施例的步骤304和步骤305中的相关内容,在此不再详细说明。
可选的,该多个异常事件之间的关系使用至少一条类型关系数据表示,每条类型关系数据包括该多个异常事件中的两个异常事件的类型信息和该两个异常事件的类型信息之间的关系,该两个异常事件所在对象之间存在该关系。
可选的,该故障信息还包括该多个异常事件中的每个异常事件所在对象的对象信息;该多个异常事件之间的关系使用至少一条对象关系数据表示,每条对象关系数据包括存在关系的两个对象的对象信息和该关系,每个异常事件所在的对象包括该两个对象;
处理单元402,还用于:
根据该至少一条对象关系数据、每个异常事件的对象信息和类型信息,获取该故障信息对应的类型关系数据集合,该类型关系数据集合包括至少一条类型关系数据,每条类型关系数据包括该多个异常事件中的两个异常事件的类型信息和该两个异常事件的类型信息之间的关系,该两个异常事件所在对象之间存在该关系;
将该故障信息集合中的每条故障信息对应的类型关系数据集合取交集得到第二关系集合。
可选的,处理单元402获取该故障信息对应的类型关系数据集合的详细实现过程,可以参见图2所示实施例中的步骤205中的相关内容,在此不再详细说明。
可选的,该多个异常事件的时间信息是该多个异常事件的产生时间的时间跨度;
处理单元402,用于从该每条故障信息中的时间跨度中选择最大时间跨度作为该时间长度。
可选的,该多个异常事件的时间信息是该多个异常事件中的每个异常事件的产生时间;
处理单元402,用于根据每条故障信息包括的时间信息,分别获取每条故障信息对应的异常事件的产生时间的时间跨度;从获取的时间跨度中选择最大时间跨度作为该时间长度。
可选的,该多个异常事件的类型包括告警类型、性能越限类型和/或网元异常日志类型。
可选的,处理单元402,还用于:
基于该至少一条对象关系数据生成对象拓扑图,显示该对象拓扑图。
在本申请实施例中,由于接收单元接收的故障信息包括第一关系集合和多个异常事件的时间信息,这样处理单元可以基于故障信息中的第一关系集合和时间信息生成故障处理规则。由于接收单元可以接收不同EMS发送的故障信息,处理单元基于各EMS的故障信息生成故障处理规则,丰富了生成故障处理所需的数据,不仅使得故障处理规则具有可学习性和更新性,还提高生成故障处理规则的效率,发送单元向至少一个EMS发送该故障处理规则,从而提高在EMS中配置故障处理规则的效率和实时性。
参见图5,本申请实施例提供了一种处理故障的装置500,所述装置500可以部署在上述图1、图2或图3所示实施例提供的EMS中,包括:
处理单元501,用于获取所述装置500所管理的至少一个网元上报的由第一故障引起的多个第一异常事件信息,以及获取第一故障的根因告警信息和处理建议信息,第一异常事件信息包括第一异常事件的类型信息、产生时间和所在对象的对象信息;
处理单元501,还用于基于每个第一异常事件的对象信息获取第一关系集合,第一关系集合包括该多个第一异常事件之间的关系;
发送单元502,用于向运营支撑系统OSS发送第一故障信息,第一故障信息包括类型信息集合、第一时间信息、第一关系集合、该根因告警信息和该处理建议信息,该类型信息集合包括每个第一异常事件的类型信息,第一时间信息包括该多个第一异常事件的产生时间或该多个第一异常事件的产生时间的时间跨度,第一故障信息用于OSS生成第一故障处理规则,第一故障处理规则用于指示接收第一故障处理规则的至少一个网元管理系统EMS处理第一故障;
其中,第一故障处理规则包括第二关系集合、第一时间长度、故障信息集合中的任一条第一故障信息中的类型信息集合、根因告警信息和处理建议信息,第一故障信息集合包括OSS接收的多条第一故障信息且该多条第一故障信息中的每条第一故障信息包括相同的类型信息集合、根因告警信息和处理建议信息,第二关系集合是OSS接收的该多条第一故障信息中的第一关系集合的交集,第一时间长度是OSS基于该多条第一故障信息中的第一时间信息得到的,第一时间长度是产生第一故障引起的第一异常事件的持续时长。
可选的,处理单元501获取由第一故障引起的多个第一异常事件信息的详细实现过程,可以参见图2所示实施例中的步骤202的相关内容,在此不再详细说明。
可选的,处理单元501获取第一关系集合的详细实现过程,可以参见图2所示实施例中的步骤2024的相关内容,在此不再详细说明。
可选的,该多个第一异常事件之间的关系使用至少一条类型关系数据表示,每条类型关系数据包括该多个第一异常事件中的两个第一异常事件的类型信息和该两个第一异常事件的类型信息之间的关系,该两个第一异常事件所在对象之间存在该关系,所述装置500包括网络拓扑图;
处理单元501,用于基于该网络拓扑图和/或每个第一异常事件的对象信息和类型信息,获取第一关系集合。
可选的,处理单元501获取至少一条类型关系数据的详细实现过程,可以参见图2所示实施例中的步骤2024的相关内容,在此不再详细说明。
可选的,该多个第一异常事件之间的关系使用至少一条对象关系数据表示,每条对象关系数据包括存在关系的两个对象的对象信息和该关系,该多个第一异常事件中的每个第一异常事件所在的对象包括该两个对象,所述装置500包括网络拓扑图;
处理单元501,用于基于网络拓扑图和/或每个第一异常事件的对象信息,获取第一关系集合。
可选的,处理单元501获取至少一条对象关系数据的详细实现过程,可以参见图2所示实施例中的步骤2024的相关内容,在此不再详细说明。
可选的,第一故障信息还包括每个第一异常事件的对象信息。
可选的,该多个第一异常事件的类型包括告警类型、性能越限类型和/或网元异常日志类型。
可选的,所述装置500还包括:接收单元503,
接收单元503,用于接收第一故障处理规则;
处理单元501,还用于基于第一故障处理规则获取至少一个网元上报的由第二故障引起 的多个第二异常事件信息,第一故障和第二故障是同类型故障,第二异常事件信息包括第二异常事件的类型信息、产生时间和所在对象的对象信息;基于每个第二异常事件的对象信息获取第三关系集合,第三关系集合包括该多个第二异常事件之间的关系;
发送单元502,还用于向OSS发送第二故障信息,第二故障信息包括第二时间信息、第三关系集合、该类型信息集合、该根因告警信息和该处理建议信息,第二时间信息包括该多个第二异常事件的产生时间或该多个第二异常事件的产生时间的时间跨度,第二故障信息用于触发OSS生成第二故障处理规则,第二故障处理规则包括第四关系集合、第二时间长度、该类型信息集合、该根因告警信息和该处理建议信息,第四关系集合是OSS接收的多条第二故障信息中的第三关系集合的交集,该多条第二故障信息均包括该类型信息集合、该根因告警信息和该处理建议信息,第二时间长度是OSS基于该多条第二故障信息中的第二时间信息得到的,第二时间长度是产生第二故障引起的第二异常事件的持续时长;
接收单元503,还用于接收第二故障处理规则;
处理单元501,还用于将第一故障处理规则更新为第二故障处理规则。
在本申请实施例中,由于处理单元基于每个第一异常事件的对象信息获取第一关系集合,第一关系集合包括该多个第一异常事件之间的关系,这样发送单元向OSS发送的故障信息包括第一关系集合和多个异常事件的时间信息。如此OSS可以将第一故障信息集合中的各第一故障信息包括的第一关系集合取交集得到第二关系集合,以及基于第一故障信息集合中的各第一故障信息包括的时间信息确定第一时间长度,从而可以得到一条包括第二关系集合和第一时间长度等内容的故障处理规则。由于OSS可以接收不同EMS发送的第一故障信息,并基于各EMS的第一故障信息生成故障处理规则,丰富了生成故障处理所需的数据,不仅使得生成的故障处理规则具有可学习性和更新性,还提高生成故障处理规则的效率,OSS向EMS发送该故障处理规则,从而提高在EMS中配置故障处理规则的效率和实时性。
参见图6,本申请实施例提供了一种处理故障的装置600示意图。该装置600可以是上述任一实施例中的OSS。该装置600包括至少一个处理器601,内部连接602,存储器603以及至少一个收发器604。
该装置600是一种硬件结构的装置,可以用于实现图7所述的装置700中的功能模块。例如,本领域技术人员可以想到图4所示的装置400中的处理单元402可以通过该至少一个处理器601调用存储器603中的代码来实现,图4所示的装置400中的接收单元401和发送单元403可以通过该收发器604来实现。
可选的,该装置600还可用于实现上述任一实施例中OSS的功能。
可选的,上述处理器601可以是一个通用中央处理器(central processing unit,CPU),网络处理器(network processor,NP),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。
上述内部连接602可包括一通路,在上述组件之间传送信息。可选的,内部连接602为单板或总线等。
上述收发器604,用于与其他设备或通信网络通信。
上述存储器603可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信 息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
其中,存储器603用于存储执行本申请方案的应用程序代码,并由处理器601来控制执行。处理器601用于执行存储器603中存储的应用程序代码,以及配合至少一个收发器604,从而使得该装置600实现本专利方法中的功能。
在具体实现中,作为一种实施例,处理器601可以包括一个或多个CPU,例如图6中的CPU0和CPU1。
在具体实现中,作为一种实施例,该装置600可以包括多个处理器,例如图6中的处理器601和处理器607。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
参见图7,本申请实施例提供了一种处理故障的装置700示意图。该装置700可以是上述任一实施例中的EMS。该装置700包括至少一个处理器701,内部连接702,存储器703以及至少一个收发器704。
该装置700是一种硬件结构的装置,可以用于实现图7所述的装置700中的功能模块。例如,本领域技术人员可以想到图5所示的装置500中的处理单元501可以通过该至少一个处理器701调用存储器703中的代码来实现,图5所示的装置900中的发送单元502和接收单元503可以通过该收发器704来实现。
可选的,该装置700还可用于实现上述任一实施例中EMS的功能。
可选的,上述处理器701可以是一个通用中央处理器(central processing unit,CPU),网络处理器(network processor,NP),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。
上述内部连接702可包括一通路,在上述组件之间传送信息。可选的,内部连接702为单板或总线等。
上述收发器704,用于与其他设备或通信网络通信。
上述存储器703可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
其中,存储器703用于存储执行本申请方案的应用程序代码,并由处理器701来控制执 行。处理器701用于执行存储器703中存储的应用程序代码,以及配合至少一个收发器704,从而使得该装置700实现本专利方法中的功能。
在具体实现中,作为一种实施例,处理器701可以包括一个或多个CPU,例如图7中的CPU0和CPU1。
在具体实现中,作为一种实施例,该装置700可以包括多个处理器,例如图7中的处理器701和处理器707。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
参见图8,本申请实施例提供了一种故障处理的系统800,所述系统800包括如图4所述的装置400和如图5所述的装置500,或者,所述系统800包括如图6所述的装置600和如图7所述的装置700。
其中,如图4所述的装置400或如图6所述的装置600可以为OSS801,如图5所述的装置500或如图7所述的装置700可以为EMS802。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (27)

  1. 一种处理故障的方法,其特征在于,所述方法包括:
    运营支撑系统OSS接收至少一个网元管理系统EMS发送的故障信息,所述故障信息包括类型信息集合、第一关系集合、故障的根因告警信息、针对所述故障的处理建议信息和由所述故障引起的多个异常事件的时间信息,所述类型信息集合包括所述多个异常事件的类型信息,所述第一关系集合包括所述多个异常事件之间的关系;
    所述OSS将包括相同类型信息集合、根因告警信息和处理建议信息的故障信息聚合成故障信息集合,将所述故障信息集合中的每条故障信息中的第一关系集合取交集得到第二关系集合,以及根据所述每条故障信息中的时间信息确定时间长度,所述时间长度是产生所述故障引起的异常事件的持续时长;
    所述OSS向所述至少一个EMS发送故障处理规则,所述故障处理规则包括所述第二关系集合、所述时间长度、第一故障信息中的类型信息集合、根因告警信息和处理建议信息,所述第一故障信息是所述故障信息集合中的任意一条故障信息,所述故障处理规则用于指示所述至少一个EMS处理所述故障。
  2. 如权利要求1所述的方法,其特征在于,所述多个异常事件之间的关系使用至少一条类型关系数据表示,每条类型关系数据包括所述多个异常事件中的两个异常事件的类型信息和所述两个异常事件的类型信息之间的关系,所述两个异常事件所在对象之间存在所述关系。
  3. 如权利要求1所述的方法,其特征在于,所述故障信息还包括所述多个异常事件中的每个异常事件所在对象的对象信息;所述多个异常事件之间的关系使用至少一条对象关系数据表示,每条对象关系数据包括存在关系的两个对象的对象信息和所述关系,所述每个异常事件所在的对象包括所述两个对象;所述方法还包括:
    所述OSS根据所述至少一条对象关系数据、所述每个异常事件的对象信息和类型信息,获取所述故障信息对应的类型关系数据集合,所述类型关系数据集合包括至少一条类型关系数据,每条类型关系数据包括所述多个异常事件中的两个异常事件的类型信息和所述两个异常事件的类型信息之间的关系,所述两个异常事件所在对象之间存在所述关系;
    所述OSS将所述故障信息集合中的每条故障信息中的第一关系集合取交集得到第二关系集合,包括:
    所述OSS将所述故障信息集合中的每条故障信息对应的类型关系数据集合取交集得到第二关系集合。
  4. 如权利要求1至3任一项所述的方法,其特征在于,所述多个异常事件的时间信息是所述多个异常事件的产生时间的时间跨度;
    所述OSS根据所述每条故障信息中的时间信息确定时间长度,包括:
    所述OSS从所述每条故障信息中的时间跨度中选择最大时间跨度作为所述时间长度。
  5. 如权利要求1至3任一项所述的方法,其特征在于,所述多个异常事件的时间信息是 所述多个异常事件中的每个异常事件的产生时间;
    所述OSS根据所述每条故障信息中的时间信息确定时间长度,包括:
    所述OSS根据所述每条故障信息包括的时间信息,分别获取所述每条故障信息对应的异常事件的产生时间的时间跨度;
    所述OSS从所述获取的时间跨度中选择最大时间跨度作为所述时间长度。
  6. 如权利要求1至5任一项所述的方法,其特征在于,所述多个异常事件的类型包括告警类型、性能越限类型和/或网元异常日志类型。
  7. 如权利要求3所述的方法,其特征在于,所述方法还包括:
    所述OSS基于所述至少一条对象关系数据生成对象拓扑图,显示所述对象拓扑图。
  8. 一种处理故障的方法,其特征在于,所述方法包括:
    网元管理系统EMS获取所述EMS所管理的至少一个网元上报的由第一故障引起的多个第一异常事件信息,以及获取所述第一故障的根因告警信息和处理建议信息,第一异常事件信息包括所述第一异常事件的类型信息、产生时间和所在对象的对象信息;
    所述EMS基于每个第一异常事件的对象信息获取第一关系集合,所述第一关系集合包括所述多个第一异常事件之间的关系;
    所述EMS向运营支撑系统OSS发送第一故障信息,所述第一故障信息包括类型信息集合、第一时间信息、所述第一关系集合、所述根因告警信息和所述处理建议信息,所述类型信息集合包括所述每个第一异常事件的类型信息,所述第一时间信息包括所述多个第一异常事件的产生时间或所述多个第一异常事件的产生时间的时间跨度,所述第一故障信息用于所述OSS生成第一故障处理规则,所述第一故障处理规则用于指示接收所述第一故障处理规则的至少一个EMS处理所述第一故障;
    其中,所述第一故障处理规则包括第二关系集合、第一时间长度、故障信息集合中的任一条第一故障信息中的类型信息集合、根因告警信息和处理建议信息,所述第一故障信息集合包括所述OSS接收的多条第一故障信息且所述多条第一故障信息中的每条第一故障信息包括相同的类型信息集合、根因告警信息和处理建议信息,所述第二关系集合是所述OSS接收的所述多条第一故障信息中的第一关系集合的交集,所述第一时间长度是所述OSS基于所述多条第一故障信息中的第一时间信息得到的,所述第一时间长度是产生所述第一故障引起的第一异常事件的持续时长。
  9. 如权利要求8所述的方法,其特征在于,所述多个第一异常事件之间的关系使用至少一条类型关系数据表示,每条类型关系数据包括所述多个第一异常事件中的两个第一异常事件的类型信息和所述两个第一异常事件的类型信息之间的关系,所述两个第一异常事件所在对象之间存在所述关系,所述EMS包括网络拓扑图;
    所述EMS基于每个第一异常事件的对象信息获取第一关系集合,包括:
    所述EMS基于所述网络拓扑图和/或每个第一异常事件的对象信息和类型信息,获取第一关系集合。
  10. 如权利要求8所述的方法,其特征在于,所述多个第一异常事件之间的关系使用至少一条对象关系数据表示,每条对象关系数据包括存在关系的两个对象的对象信息和所述关系,所述多个第一异常事件中的每个第一异常事件所在的对象包括所述两个对象,所述EMS包括网络拓扑图;
    所述EMS基于每个第一异常事件的对象信息获取第一关系集合,包括:
    所述EMS基于所述网络拓扑图和/或每个第一异常事件的对象信息,获取第一关系集合。
  11. 如权利要求10所述的方法,其特征在于,所述第一故障信息还包括所述每个第一异常事件的对象信息。
  12. 如权利要求8至11任一项所述的方法,其特征在于,所述多个第一异常事件的类型包括告警类型、性能越限类型和/或网元异常日志类型。
  13. 如权利要求8至12任一项所述的方法,其特征在于,所述方法还包括:
    所述EMS接收所述第一故障处理规则,基于所述第一故障处理规则获取所述至少一个网元上报的由第二故障引起的多个第二异常事件信息,所述第一故障和所述第二故障是同类型故障,第二异常事件信息包括所述第二异常事件的类型信息、产生时间和所在对象的对象信息;
    所述EMS基于每个第二异常事件的对象信息获取第三关系集合,所述第三关系集合包括所述多个第二异常事件之间的关系;
    所述EMS向所述OSS发送第二故障信息,所述第二故障信息包括第二时间信息、所述第三关系集合、所述类型信息集合、所述根因告警信息和所述处理建议信息,所述第二时间信息包括所述多个第二异常事件的产生时间或所述多个第二异常事件的产生时间的时间跨度,所述第二故障信息用于所述OSS生成第二故障处理规则,所述第二故障处理规则包括第四关系集合、第二时间长度、所述类型信息集合、所述根因告警信息和所述处理建议信息,所述第四关系集合是所述OSS接收的多条第二故障信息中的第三关系集合的交集,所述多条第二故障信息均包括所述类型信息集合、所述根因告警信息和所述处理建议信息,所述第二时间长度是所述OSS基于所述多条第二故障信息中的第二时间信息得到的,所述第二时间长度是产生所述第二故障引起的第二异常事件的持续时长;
    所述EMS接收所述第二故障处理规则,将所述第一故障处理规则更新为所述第二故障处理规则。
  14. 一种处理故障的装置,其特征在于,所述装置包括:
    接收单元,用于接收至少一个网元管理系统EMS发送的故障信息,所述故障信息包括类型信息集合、第一关系集合、故障的根因告警信息、针对所述故障的处理建议信息和由所述故障引起的多个异常事件的时间信息,所述类型信息集合包括所述多个异常事件的类型信息,所述第一关系集合包括所述多个异常事件之间的关系;
    处理单元,用于将包括相同类型信息集合、根因告警信息和处理建议信息的故障信息聚合成故障信息集合,将所述故障信息集合中的每条故障信息中的第一关系集合取交集得到第二关系集合,以及根据所述每条故障信息中的时间信息确定时间长度,所述时间长度是产生 所述故障引起的异常事件的持续时长;
    发送单元,用于向所述至少一个EMS发送故障处理规则,所述故障处理规则包括所述第二关系集合、所述时间长度、第一故障信息中的类型信息集合、根因告警信息和处理建议信息,所述第一故障信息是所述故障信息集合中的任意一条故障信息,所述故障处理规则用于指示所述至少一个EMS处理所述故障。
  15. 如权利要求14所述的装置,其特征在于,所述多个异常事件之间的关系使用至少一条类型关系数据表示,每条类型关系数据包括所述多个异常事件中的两个异常事件的类型信息和所述两个异常事件的类型信息之间的关系,所述两个异常事件所在对象之间存在所述关系。
  16. 如权利要求14所述的装置,其特征在于,所述故障信息还包括所述多个异常事件中的每个异常事件所在对象的对象信息;所述多个异常事件之间的关系使用至少一条对象关系数据表示,每条对象关系数据包括存在关系的两个对象的对象信息和所述关系,所述每个异常事件所在的对象包括所述两个对象;所述处理单元,还用于:
    根据所述至少一条对象关系数据、所述每个异常事件的对象信息和类型信息,获取所述故障信息对应的类型关系数据集合,所述类型关系数据集合包括至少一条类型关系数据,每条类型关系数据包括所述多个异常事件中的两个异常事件的类型信息和所述两个异常事件的类型信息之间的关系,所述两个异常事件所在对象之间存在所述关系;
    将所述故障信息集合中的每条故障信息对应的类型关系数据集合取交集得到第二关系集合。
  17. 如权利要求14至16任一项所述的装置,其特征在于,所述多个异常事件的时间信息是所述多个异常事件的产生时间的时间跨度;
    所述处理单元,用于从所述每条故障信息中的时间跨度中选择最大时间跨度作为所述时间长度。
  18. 如权利要求14至16任一项所述的装置,其特征在于,所述多个异常事件的时间信息是所述多个异常事件中的每个异常事件的产生时间;
    所述处理单元,用于根据所述每条故障信息包括的时间信息,分别获取所述每条故障信息对应的异常事件的产生时间的时间跨度;从所述获取的时间跨度中选择最大时间跨度作为所述时间长度。
  19. 如权利要求14至18任一项所述的装置,其特征在于,所述多个异常事件的类型包括告警类型、性能越限类型和/或网元异常日志类型。
  20. 如权利要求16所述的装置,其特征在于,所述处理单元,还用于:
    基于所述至少一条对象关系数据生成对象拓扑图,显示所述对象拓扑图。
  21. 一种处理故障的装置,其特征在于,所述装置包括:
    处理单元,用于获取所述装置所管理的至少一个网元上报的由第一故障引起的多个第一异常事件信息,以及获取所述第一故障的根因告警信息和处理建议信息,第一异常事件信息包括所述第一异常事件的类型信息、产生时间和所在对象的对象信息;
    所述处理单元,还用于基于每个第一异常事件的对象信息获取第一关系集合,所述第一关系集合包括所述多个第一异常事件之间的关系;
    发送单元,用于向运营支撑系统OSS发送第一故障信息,所述第一故障信息包括类型信息集合、第一时间信息、所述第一关系集合、所述根因告警信息和所述处理建议信息,所述类型信息集合包括所述每个第一异常事件的类型信息,所述第一时间信息包括所述多个第一异常事件的产生时间或所述多个第一异常事件的产生时间的时间跨度,所述第一故障信息用于所述OSS生成第一故障处理规则,所述第一故障处理规则用于指示接收所述第一故障处理规则的至少一个网元管理系统EMS处理所述第一故障;
    其中,所述第一故障处理规则包括第二关系集合、第一时间长度、故障信息集合中的任一条第一故障信息中的类型信息集合、根因告警信息和处理建议信息,所述第一故障信息集合包括所述OSS接收的多条第一故障信息且所述多条第一故障信息中的每条第一故障信息包括相同的类型信息集合、根因告警信息和处理建议信息,所述第二关系集合是所述OSS接收的所述多条第一故障信息中的第一关系集合的交集,所述第一时间长度是所述OSS基于所述多条第一故障信息中的第一时间信息得到的,所述第一时间长度是产生所述第一故障引起的第一异常事件的持续时长。
  22. 如权利要求21所述的装置,其特征在于,所述多个第一异常事件之间的关系使用至少一条类型关系数据表示,每条类型关系数据包括所述多个第一异常事件中的两个第一异常事件的类型信息和所述两个第一异常事件的类型信息之间的关系,所述两个第一异常事件所在对象之间存在所述关系,所述装置包括网络拓扑图;
    所述处理单元,用于基于所述网络拓扑图和/或每个第一异常事件的对象信息和类型信息,获取第一关系集合。
  23. 如权利要求21所述的装置,其特征在于,所述多个第一异常事件之间的关系使用至少一条对象关系数据表示,每条对象关系数据包括存在关系的两个对象的对象信息和所述关系,所述多个第一异常事件中的每个第一异常事件所在的对象包括所述两个对象,所述装置包括网络拓扑图;
    所述处理单元,用于基于所述网络拓扑图和/或每个第一异常事件的对象信息,获取第一关系集合。
  24. 如权利要求23所述的装置,其特征在于,所述第一故障信息还包括所述每个第一异常事件的对象信息。
  25. 如权利要求21至24任一项所述的装置,其特征在于,所述多个第一异常事件的类型包括告警类型、性能越限类型和/或网元异常日志类型。
  26. 如权利要求21至25任一项所述的装置,其特征在于,所述装置还包括:接收单元,
    所述接收单元,用于接收所述第一故障处理规则;
    所述处理单元,还用于基于所述第一故障处理规则获取所述至少一个网元上报的由第二故障引起的多个第二异常事件信息,所述第一故障和所述第二故障是同类型故障,第二异常事件信息包括所述第二异常事件的类型信息、产生时间和所在对象的对象信息;基于每个第二异常事件的对象信息获取第三关系集合,所述第三关系集合包括所述多个第二异常事件之间的关系;
    所述发送单元,还用于向所述OSS发送第二故障信息,所述第二故障信息包括第二时间信息、所述第三关系集合、所述类型信息集合、所述根因告警信息和所述处理建议信息,所述第二时间信息包括所述多个第二异常事件的产生时间或所述多个第二异常事件的产生时间的时间跨度,所述第二故障信息用于所述OSS生成第二故障处理规则,所述第二故障处理规则包括第四关系集合、第二时间长度、所述类型信息集合、所述根因告警信息和所述处理建议信息,所述第四关系集合是所述OSS接收的多条第二故障信息中的第三关系集合的交集,所述多条第二故障信息均包括所述类型信息集合、所述根因告警信息和所述处理建议信息,所述第二时间长度是所述OSS基于所述多条第二故障信息中的第二时间信息得到的,所述第二时间长度是产生所述第二故障引起的第二异常事件的持续时长;
    所述接收单元,还用于接收所述第二故障处理规则;
    所述处理单元,还用于将所述第一故障处理规则更新为所述第二故障处理规则。
  27. 一种故障处理的系统,其特征在于,所述系统包括如权利要求14至20任一项所述的装置和如权利要求21至26任一项所述的装置。
PCT/CN2021/103583 2020-11-16 2021-06-30 处理故障的方法、装置及系统 WO2022100108A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP21890639.4A EP4181475A4 (en) 2020-11-16 2021-06-30 METHOD, DEVICE AND SYSTEM FOR ERROR PROCESSING

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011281081.9 2020-11-16
CN202011281081.9A CN114584452A (zh) 2020-11-16 2020-11-16 处理故障的方法、装置及系统

Publications (1)

Publication Number Publication Date
WO2022100108A1 true WO2022100108A1 (zh) 2022-05-19

Family

ID=81600727

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/103583 WO2022100108A1 (zh) 2020-11-16 2021-06-30 处理故障的方法、装置及系统

Country Status (3)

Country Link
EP (1) EP4181475A4 (zh)
CN (1) CN114584452A (zh)
WO (1) WO2022100108A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116032725A (zh) * 2022-12-27 2023-04-28 中国联合网络通信集团有限公司 故障根因定位模型的生成方法及装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290133A (zh) * 2022-06-16 2023-12-26 中兴通讯股份有限公司 异常事件处理方法、电子设备及存储介质
CN115913895B (zh) * 2022-11-08 2024-10-15 苏州浪潮智能科技有限公司 一种服务器故障诊断告警的方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102006198A (zh) * 2010-12-16 2011-04-06 中国电子科技集团公司第三十研究所 一种网络故障关联规则获取方法及装置
WO2019084226A1 (en) * 2017-10-26 2019-05-02 Cisco Technology, Inc. SYSTEM AND METHOD FOR HYBRID AND ELASTIC SERVICES
CN109769261A (zh) * 2019-03-25 2019-05-17 新华三技术有限公司 一种网络故障处理方法及装置
WO2019161936A1 (en) * 2018-02-26 2019-08-29 Telefonaktiebolaget Lm Ericsson (Publ) Network slicing with smart contracts
CN111106944A (zh) * 2018-10-26 2020-05-05 中国移动通信有限公司研究院 一种故障告警信息处理方法及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102006198A (zh) * 2010-12-16 2011-04-06 中国电子科技集团公司第三十研究所 一种网络故障关联规则获取方法及装置
WO2019084226A1 (en) * 2017-10-26 2019-05-02 Cisco Technology, Inc. SYSTEM AND METHOD FOR HYBRID AND ELASTIC SERVICES
WO2019161936A1 (en) * 2018-02-26 2019-08-29 Telefonaktiebolaget Lm Ericsson (Publ) Network slicing with smart contracts
CN111106944A (zh) * 2018-10-26 2020-05-05 中国移动通信有限公司研究院 一种故障告警信息处理方法及设备
CN109769261A (zh) * 2019-03-25 2019-05-17 新华三技术有限公司 一种网络故障处理方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4181475A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116032725A (zh) * 2022-12-27 2023-04-28 中国联合网络通信集团有限公司 故障根因定位模型的生成方法及装置
CN116032725B (zh) * 2022-12-27 2024-06-11 中国联合网络通信集团有限公司 故障根因定位模型的生成方法及装置

Also Published As

Publication number Publication date
EP4181475A1 (en) 2023-05-17
CN114584452A (zh) 2022-06-03
EP4181475A4 (en) 2023-12-20

Similar Documents

Publication Publication Date Title
WO2022100108A1 (zh) 处理故障的方法、装置及系统
US10193706B2 (en) Distributed rule provisioning in an extended bridge
EP3327637B1 (en) On-demand fault reduction framework
EP2160867B1 (en) Method of processing event notifications and event subscriptions
WO2022083540A1 (zh) 故障恢复预案确定方法、装置及系统、计算机存储介质
WO2020228276A1 (zh) 网络告警的方法及装置
US11831492B2 (en) Group-based network event notification
CN104969222A (zh) 使用层次数据结构节点上递归式事件监听器的方法和系统
US11165684B2 (en) Route consistency checker for network devices
CN114285795A (zh) 一种虚拟设备的状态控制方法、装置、设备及存储介质
CN110430138B (zh) 数据流转发状态记录方法及网络设备
US20190296960A1 (en) System and method for event processing order guarantee
CN107885634B (zh) 监控中异常信息的处理方法和装置
US20230336453A1 (en) Techniques for providing inter-cluster dependencies
CN109510730A (zh) 分布式系统及其监控方法、装置、电子设备及存储介质
WO2016188500A1 (zh) 一种业务割接的方法、装置及设备
WO2017143986A1 (zh) 确定资源指标的方法和装置
EP4336883A1 (en) Modeling method, network element data processing method and apparatus, electronic device, and medium
US20230153725A1 (en) Techniques for determining service risks and causes
CN112637053B (zh) 路由的备份转发路径的确定方法及装置
US12132603B2 (en) Data processing method, apparatus, and system, and storage medium
CN109660544A (zh) 网络安全审查方法及装置
US20240291740A1 (en) Cause inference regarding network trouble
CN113992566B (zh) 一种报文广播方法及装置
WO2024169467A1 (zh) 分布式网络的故障定位方法、网络设备和存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021890639

Country of ref document: EP

Effective date: 20230208

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21890639

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE