WO2015051638A1 - Fault location method and device - Google Patents

Fault location method and device Download PDF

Info

Publication number
WO2015051638A1
WO2015051638A1 PCT/CN2014/076867 CN2014076867W WO2015051638A1 WO 2015051638 A1 WO2015051638 A1 WO 2015051638A1 CN 2014076867 W CN2014076867 W CN 2014076867W WO 2015051638 A1 WO2015051638 A1 WO 2015051638A1
Authority
WO
WIPO (PCT)
Prior art keywords
alarm
fault
probability
cause
occurrence
Prior art date
Application number
PCT/CN2014/076867
Other languages
French (fr)
Chinese (zh)
Inventor
杨凡
何诚
钱剑锋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015051638A1 publication Critical patent/WO2015051638A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Definitions

  • the present invention relates to the field of network technologies, and in particular, to a fault location method and apparatus. Background technique
  • the network device (network element) generates different alarm information according to its own running status and environment changes during operation.
  • the alarm information includes: an alarm reference document and an alarm fault cause, and sends the alarm information to the network management system. Therefore, the network management and operation and maintenance personnel can check the possible fault causes according to the alarm reference document in the alarm information, thereby locating the cause of the network failure.
  • the network management system collects a large amount of alarm information, and there are also a large number of prompt alarms and derivative alarms, the efficiency of network management and operation and maintenance personnel to locate faults based on alarm information is reduced.
  • one method is to remove the derivative alarm by analyzing the correlation between the alarms (ie, the associated alarm analysis), that is, using a specific logical relationship according to the alarm field information. Perform correlation alarm analysis to remove derivative alarms.
  • Another method is to use a probabilistic diagnosis model to calculate the probability of a fault occurring under an alarm condition, that is, to determine whether the alarm will occur within a preset time window by a priori probability of failure, and set a probability threshold, and then, according to the alarm Whether it occurs within the preset time window to determine the cause of the fault and calculate the probability of the fault occurring under the alarm condition.
  • a fault location method and apparatus are provided in the embodiment of the present invention to solve the technical problem of inaccurate fault location and low positioning efficiency in the prior art.
  • the embodiment of the present invention discloses the following technical solutions:
  • the first aspect provides a fault location method, the method comprising:
  • the probability of occurrence of the fault cause that meets the preset range is selected as the fault cause of the alarm set; [13] hierarchically classifying the fault cause of the alarm set to achieve hierarchical positioning of the fault cause.
  • the collecting, by the fault location target, the classification of the alarm fault causes, and obtaining the alarm fault cause set includes:
  • the alarm fault causes are combined and classified to obtain an alarm fault cause set.
  • the merging and categorizing the alarm fault causes according to the fault locating target is further obtained
  • the alarm name set; the method further includes:
  • the degree of association is used to verify the validity of the merged category.
  • the establishing, by the establishing, the alarm set and the alarm failure cause set Relationships including:
  • the determining, by the predetermined time window, the alarm set corresponds to The probability of the cause of the failure, including:
  • the calculating a predetermined time window alarm The probability of occurrence of each type of alarm in the set, including:
  • the frequency of occurrence of the alarm within the predetermined time window is used as the probability of occurrence of the alarm within the preset time window.
  • the computing The probability of occurrence of the fault cause corresponding to each type of alarm includes: [28] initializing the probability of occurrence of the fault cause of each type of alarm;
  • Performing hierarchical hierarchical classification on the fault cause of the alarm set to implement hierarchical positioning of the fault cause including:
  • the second aspect provides a fault location device, including: [35] an extracting unit, configured to extract alarm information of each device in the network;
  • a first establishing unit configured to establish a correspondence between an alarm name and an alarm failure cause in the alarm information
  • a processing unit configured to combine and classify the alarm fault causes according to the fault location target, and obtain a set of alarm fault causes
  • an acquiring unit configured to acquire an alarm set of alarm data in the live network
  • a second establishing unit configured to establish a correspondence between the alarm set and the alarm failure cause set; wherein the alarm set includes: a set of alarm names;
  • a determining unit configured to determine a probability of occurrence of a fault cause corresponding to the alarm set in a predetermined time window
  • a selecting unit configured to select a probability that the fault cause that meets the preset range occurs as a fault cause of the alarm set
  • the locating unit is configured to hierarchically classify the fault causes of the alarm set, and implement hierarchical positioning of the fault cause.
  • the processing unit includes:
  • [44] setting unit used to set the fault location target according to the principle of equipment fault location
  • the classification unit is configured to combine and classify the alarm fault causes according to the fault location target, and obtain an alarm fault cause set.
  • the classification unit is further configured to: after combining the alarm fault causes according to the fault location target, further obtaining an alarm name set; the device further includes:
  • a calculating unit configured to calculate a degree of association of the set of alarm names after the merged classification, where the degree of association is used to verify the validity of the merged category.
  • the first establishing unit is specifically configured to use the alarm name and A correspondence between the alarm failure causes and a bipartite graph of the alarm failure cause set is established.
  • the determining unit includes:
  • a first probability calculation unit configured to calculate a probability of occurrence of each type of alarm in the alarm set in the predetermined time window
  • a second probability calculation unit configured to calculate a probability of occurrence of a fault cause corresponding to each type of alarm
  • the probability determining unit is configured to determine, according to the probability of occurrence of each type of alarm and the probability of occurrence of each corresponding fault cause, a probability of occurrence of a fault cause corresponding to the alarm set in the predetermined time window.
  • the first probability calculation unit is specifically configured to use a frequency of occurrence of an alarm within a predetermined time window as a probability of occurrence of an alarm within a preset time window.
  • the second The probability calculation unit includes:
  • an initialization unit configured to initialize a probability of occurrence of the fault cause of each type of alarm
  • a verification unit configured to calculate and verify a probability of occurrence of a failure cause of each type of alarm within the predetermined time window
  • the positioning unit includes:
  • a hierarchical classification unit configured to hierarchically classify fault causes of the alarm set, and obtain a reason for merging each layer
  • Hierarchical locating unit which is used to calculate the cause of the fault step by step to complete the level of the fault locating target Positioning.
  • the third aspect provides a fault location device, including:
  • the alarm information extracting unit is configured to extract the alarm information of each device in the network, and establish a correspondence between the alarm name and the alarm fault cause in the alarm information; [65] the alarm information processing unit is configured to follow the fault.
  • the positioning target combines and classifies the alarm fault causes to obtain a set of alarm fault causes;
  • an alarm data processing unit configured to acquire an alarm set of alarm data, and establish a correspondence between the alarm set and the alarm failure cause set; wherein the alarm set includes: an alarm name set; determining a predetermined time window The probability of occurrence of the fault cause corresponding to the alarm set; selecting the probability that the fault cause occurs in the preset range as the fault cause of the alarm set;
  • the fault level locating unit is configured to hierarchically classify the fault causes of the alarm set to achieve hierarchical positioning of the fault cause.
  • the alarm information processing unit is specifically configured to: set a fault location target according to a device fault location principle; and perform the alarm fault according to the fault location target The reason is to perform a combined classification to obtain a set of alarm failure causes.
  • the alarm information processing unit is configured to perform the alarm fault according to the fault location target.
  • the reason for performing the merging classification to obtain the alarm name set after the merging is also used to calculate the association degree of the alarm name set after the merging classification, and the association degree is used to verify the validity of the merging classification.
  • the alarm data processing unit establishes the alarm set and the alarm fault
  • the correspondence between the set of causes includes: establishing a bipartite graph of the set of alarms and the set of alarm failure causes according to a correspondence between the alarm name and an alarm failure cause.
  • the alarm data processing unit determines the preset time window
  • the probability of occurrence of the fault cause corresponding to the alarm set includes: calculating a probability of occurrence of each alarm in the alarm set in the preset time window; Calculating a probability of occurrence of the fault cause corresponding to each type of alarm; determining a fault cause corresponding to the alarm set in the preset time window according to the probability of occurrence of each type of alarm and the probability of occurrence of each type of fault cause Probability of occurrence; selecting the probability of occurrence of the failure cause that satisfies the preset range as the cause of failure of the alarm set.
  • the alarm data processing unit calculates a pre- The probability of occurrence of each type of alarm in the alarm set in the time window is set to include: the frequency of occurrence of the alarm within the predetermined time window as the probability of occurrence of the alarm within the predetermined time window.
  • the alarm data calculates a probability of occurrence of the fault cause corresponding to each type of alarm, including: initializing a probability of occurrence of the fault cause of each type of alarm; calculating and verifying that the fault cause of each type of alarm occurs within the predetermined time window Probability; update the probability of occurrence of the corresponding failure cause for each of the alarms.
  • the fault tier locating unit is specifically configured to perform hierarchical merging and categorizing the fault causes of the alarm set to obtain a merging reason of each layer; and calculating a fault cause step by step to complete the hierarchical positioning of the fault locating target.
  • the alarm information is first extracted from the alarm design and description document, and then the fault occurrence targets are combined and classified according to the fault location target, and the current time window is calculated.
  • the probability of the failure of the alarm set (or alarm sequence) in the network occurs, thereby achieving hierarchical positioning of the fault cause and improving the fault location efficiency.
  • the correlation of alarms is also improved.
  • the accuracy of the fault location of the alarm is improved.
  • FIG. 1 is a flowchart of a method for locating a fault according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a bipartite graph of an alarm set and a fault cause set according to an embodiment of the present invention;
  • FIG. 3 is a diagram showing an example of using a frequency of occurrence of an alarm to replace the probability of occurrence of an alarm provided in the embodiment
  • FIG. 4 is a schematic diagram of a probability of occurrence of an update failure according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of fault hierarchy positioning according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a fault locating device according to an embodiment of the present invention.
  • FIG. 8 is another schematic structural diagram of a fault locating device according to an embodiment of the present invention;
  • FIG. 9 is a schematic structural diagram of a server according to an embodiment of the present disclosure, and a specific implementation manner
  • the following embodiments of the present invention provide a fault location method and apparatus.
  • the causes of failures are combined and classified, and the probability of occurrence of the faults after the merger and classification is calculated, thereby effectively improving the efficiency and accuracy of the fault location.
  • the technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. example. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
  • FIG. 1 is a flowchart of a method for locating a fault according to an embodiment of the present invention.
  • an alarm design or a description file is included in a target network of a fault location, and the alarm includes: And the cause of the alarm.
  • the method includes:
  • Step 101 Obtain the alarm information of each device in the network, and establish a correspondence between the alarm name and the alarm failure cause in the alarm information.
  • the server can design and explain the alarm (or reference) from each device in the network.
  • the alarm information is extracted from the file, and the alarm information includes: an alarm name, an alarm fault cause, and an alarm ID.
  • the identifier is not limited thereto.
  • the alarm information of the device usually includes: an explanation of the alarm and a reference document for explaining and explaining the alarm.
  • the alarm information includes: an alarm name and an alarm failure cause (a cause of the alarm); and may include: an ID (number) and an alarm level (indicating the severity of the alarm).
  • an alarm name and an alarm failure cause a cause of the alarm
  • an ID number
  • an alarm level indicating the severity of the alarm.
  • the alarm name, the alarm ID, and the fault cause of the alarm are extracted, and the corresponding relationship between the alarm name and the alarm fault cause is established.
  • Table 1 For example, it is not limited to this:
  • Step 102 Combine and classify the alarm fault causes according to the fault locating target, and obtain an alarm fault cause set; wherein, the alarm fault cause set is obtained by combining and categorizing various alarm fault causes in Table 1 to obtain a set. .
  • the server first sets the fault location target according to the device fault location principle. That is, since the alarm information is generally reported to the network management platform, the alarm information includes fault information of the entire network.
  • the fault location target is the level that needs to be located.
  • the fault location target can be set to one device, or one module of the device, or the software, hardware, and configuration problems of the device module.
  • the design principles of the fault location target can be related to the composition of the network and the module design of the device.
  • the alarm fault causes are combined and classified according to the fault location target, and an alarm fault cause set is obtained. That is to say, after the fault location target is designed, the alarm fault causes can be combined and classified according to the fault location target. For example, it can be divided according to the module of the device. For example, a device may contain 3 modules, and the fault location target is to locate the module, and all the fault causes in the module can be combined and classified.
  • the causes of failures are combined and classified by using the Natural Language Process (NLP) method.
  • NLP Natural Language Process
  • Table 2 An example of the combination of the causes of the failure is shown in Table 2:
  • the cause of the alarm fault (ie, the initial fault cause) may be merged into a "merger reason (1)”, or may be merged into a “merger cause (2). ". Obviously, the "merge reason (2)” is coarser than the "merge reason (1)”.
  • the alarm fault causes are combined and classified, and an alarm name set is also obtained; and the alarm name set after the combined classification is calculated (also The degree of association, referred to as an alarm set, is used to verify the validity of the merged classification.
  • the alarm name set is a combination of the various types of alarm names in Table 1.
  • the corresponding alarm ID set may be obtained, where the alarm ID set is a set obtained by classifying and combining the various types of IDs in Table 1. [100] That is to say, for the alarm set / ⁇ 2 , ⁇ , ⁇ , , if the alarm set association degree before the merge classification is /.
  • the association degree of the alarm set after the combination is /:, then, when /: ⁇ /. When merged into a valid merge.
  • alarm similarity and association degree of the alarm set [103] wherein the alarm similarity refers to the degree of similarity of the alarm.
  • the alarm similarity refers to the degree of similarity of the alarm.
  • U (AUB) is the union of the alarm failure cause sets U (A) and U (B)
  • ⁇ ( ⁇ ⁇ ⁇ ) is the alarm failure cause set U ( ⁇ ) and U ( ⁇ ) The intersection of the alarms and the ⁇ .
  • the degree of association of the alarm set refers to the degree of similar alarms in the alarm set.
  • Step 103 Obtain an alarm set of alarm data in the current network, and establish a correspondence between the alarm set and the alarm failure cause set; wherein the alarm set includes: a set of alarm names;
  • the alarm data refers to an alarm obtained from the current network, where the alarm includes an alarm name, a time when the alarm occurs, and an alarm frequency.
  • the alarm set includes: various alarm names, the time when various types of alarms occur, and each Class alarm frequency, etc.
  • the server establishes a correspondence between the alarm set and the alarm failure cause set, and may establish the alarm set and the location according to a correspondence between the alarm name and an alarm failure cause.
  • Step 104 determining a probability of occurrence of a fault cause corresponding to the alarm set in a predetermined time window;
  • the probability of a cause of failure under a known set of alarms (which may also be referred to as an alarm sequence) is calculated.
  • the probability of occurrence of each failure cause can be calculated using a Bayesian network, a Markov chain, or the like.
  • FIG. 3 is a schematic diagram of the probability that the occurrence of an alarm is used instead of the probability of occurrence of an alarm in the embodiment.
  • the method may further include: updating a probability of occurrence of the fault cause corresponding to each alarm.
  • a schematic diagram of the probability of updating the cause of the failure occurs as shown in FIG.
  • the cause of an alarm is analyzed from the alarm data of the existing network, and it is determined that the alarm is generated due to which fault cause. After the fault is verified, the next step is to calculate the cause of the fault of the alarm.
  • the alarm ator, . generated by the cause of the fault / ⁇ is determined within a time window, the alarm failure reason is updated as follows:
  • P al is the probability of occurrence of the failure cause of each type of alarm
  • / ⁇ is the probability of generating the alarm i for the failure cause j, 7 ⁇ TM ⁇
  • is the probability that an alarm i is generated for a cause other than the cause of the failure
  • C is a constant.
  • the probability of occurrence of the fault cause corresponding to the alarm set in the time window is determined according to the probability of occurrence of each type of alarm and the probability of occurrence of each corresponding fault cause.
  • ⁇ alarm alarm 2 , .,alarm n >) ⁇ fiP a i arm ⁇
  • the probability of occurrence of the alarm ator in the time window is the same as the probability of occurrence of each alarm in the alarm set in the calculation time window, that is, the frequency of occurrence of the alarm is used to replace the probability of occurrence of the alarm.
  • Fault j E alarm indicating that fault is one of the causes of ⁇
  • p darmu indicates the probability of alarm a/ar, medium/
  • the calculation process is detailed in the calculation of the fault cause corresponding to each alarm.
  • the process of probability is the same, as detailed above.
  • Step 105 Select a probability that the failure cause that meets the preset range occurs as a fault cause of the alarm set
  • the server selects the above-mentioned predetermined range (such as the cause of the failure with the highest probability of selection, or the selection of three failure causes with a relatively high probability, or the selection of the cause of the failure within a certain interval, etc.)
  • the probability of the failure cause is the cause of the failure of the alarm set or alarm sequence.
  • the preset range is dynamically set according to requirements, for example, it may be at least one of the top 10 with the highest probability of failure, etc., but is not limited thereto.
  • select the top N with the highest probability of failure such as the first one, the first five, the first eight or the first 10, etc., the value of N may need to be adaptively selected) as the alarm set in the time window.
  • the set of fault causes or sequence of fault causes).
  • Step 106 Perform hierarchical hierarchical classification on the fault cause of the alarm set to implement hierarchical positioning of the fault cause.
  • the cause of the failure is combined and classified, and the category with the largest cause of the failure is obtained, and the calculation is performed.
  • the cause of the fault and then calculate the cause of the fault step by step until the fault location target is completed.
  • the fault hierarchy positioning diagram is shown in Figure 5.
  • the alarm set (or alarm sequence) in the known time window alarm alarm ⁇ alarm calculate and merge the classified fault causes, determine the classification The cause of the failure; use the alarm set or alarm sequence, and calculate the cause of the upper-level failure that is merged into the cause of the failure. Until the fault is located. If the fault location target is the original cause, it is not until the original cause is located.
  • the alarm set or alarm sequence is used to calculate the cause of the alarm under the cause of the fault at the level (
  • the probability of the fault cause of the combined classification is the probability of the fault under this level. For example, there may be three fault causes for an alarm.
  • the probability of each fault is 1, and the combined classification is one.
  • the probability is 1), calculate
  • This level of "monitoring device configuration” has the highest probability. If you need to locate the initial cause, use the alarm sequence and continue to calculate the upper layer according to the above method to determine that the "monitoring device is plugged in” is the cause of the alarm failure.
  • the alarm information is extracted from the alarm design and description document, and then the fault occurrence targets are combined and classified according to the fault location target, and the alarm set in the current network in the time window is calculated (or The probability of the occurrence of the fault of the alarm sequence, thereby achieving hierarchical positioning of the fault cause and improving the fault location efficiency. Further, by correlating the causes of alarm failures, the correlation of alarms is also improved. As well as the probability of the failure of the alarm being updated in real time, the accuracy of the fault location of the alarm is improved.
  • the embodiment of the present invention further provides a fault locating device, which is shown in FIG. 7.
  • the device includes: an extracting unit 71, a first establishing unit 72, and a processing unit 73.
  • the extracting unit 71 is configured to extract the alarm information of each device in the network, and specifically, extract the alarm information from the alarm design and description documents of each device in the network.
  • the first establishing unit 72 is configured to establish a correspondence between the alarm name and the alarm fault cause in the alarm information, and specifically configured to establish a bipartite graph of the alarm set and the alarm fault cause set.
  • the definition of the bipartite graph is as described above, and will not be described here.
  • the processing unit 73 is configured to perform a combined classification on the alarm fault cause according to the fault location target, to obtain an alarm fault cause set.
  • the processing unit includes: a setting unit and a classification unit, where the setting unit is configured to set a fault location target according to a device fault location principle; and the classification unit is configured to follow The fault locating target combines and classifies the alarm fault causes to obtain a set of alarm fault causes.
  • the obtaining unit 74 is configured to acquire an alarm set of alarm data in the live network.
  • the second establishing unit 75 is configured to establish a correspondence between the alarm set and the alarm failure cause set, where the alarm set includes: a set of alarm names;
  • the determining unit 76 is configured to determine a probability of occurrence of a fault cause corresponding to the alarm set in a preset time window
  • the determining unit includes: a first probability calculating unit, a second probability calculating unit, and a probability determining unit, where the first probability calculating unit is configured to calculate each of the preset time window alarm sets The probability of occurrence of the alarm is specifically used to use the frequency of occurrence of the alarm in the preset time window as the probability of occurrence of the alarm in the time window; the second probability calculation unit is configured to calculate the occurrence of the fault corresponding to each alarm The probability determining unit is configured to determine, according to the probability of occurrence of each type of alarm and the probability of occurrence of each fault cause, the probability of occurrence of the fault cause corresponding to the alarm set in the preset time window.
  • the second probability calculation unit includes: an initialization unit and a verification unit, where the initialization unit is configured to initialize a probability that a fault cause of each alarm occurs; the verification unit is configured to The time window calculates and verifies the probability of occurrence of the fault cause of each type of alarm. Further, the method further includes: an update unit, configured to update a probability of occurrence of a corresponding fault cause of each type of alarm.
  • the classification unit is further configured to: perform the alarm failure according to the fault location target After the categorization is performed, the alarm name set is also obtained.
  • the device further includes: a calculating unit, configured to calculate a correlation degree of the alarm name set after the merging, the correlation degree is used to verify that the merged category is valid. Sex.
  • the selecting unit 77 is configured to select a probability that the fault cause that meets the preset range occurs as a probability of occurrence of a fault cause corresponding to the alarm set.
  • the locating unit 78 is configured to perform hierarchical hierarchical classification on the fault cause of the alarm set, and implement hierarchical positioning of the fault cause.
  • the positioning unit includes: a hierarchical classification unit and a hierarchical positioning unit, where the hierarchical classification unit is configured to perform hierarchical hierarchical classification on the fault causes of the alarm set, and obtain a combination reason of each layer.
  • the hierarchical positioning unit is configured to calculate the cause of the fault step by step to complete the hierarchical positioning of the fault location target.
  • the optional device may be integrated in the terminal, or may be deployed independently. This embodiment is not limited.
  • the embodiment of the present invention further provides a fault locating device, which is shown in FIG. 8.
  • the device includes: an alarm information extracting unit 81, an alarm information processing unit 82, an alarm data processing unit 83, and Fault hierarchy locating unit 84, wherein
  • the alarm information extraction unit 81 is configured to generate alarm information of each device in the network, and establish a correspondence between the alarm name and the alarm failure cause in the alarm information.
  • the alarm data processing unit 83 is configured to acquire an alarm set of alarm data, and establish a correspondence between the alarm set and the alarm failure cause set; wherein the alarm set includes: an alarm name set; Setting a probability that the fault cause corresponding to the alarm set in the time window occurs; selecting a probability that the fault cause occurs in the preset range as a fault cause of the alarm set;
  • the fault hierarchy locating unit 84 is configured to perform hierarchical merging on the fault cause of the alarm set. Classification, achieving hierarchical positioning of the cause of the failure.
  • the alarm information processing unit 82 is specifically configured to: set a fault location target according to a device fault location principle; perform the combined classification of the alarm fault cause according to the fault location target, and obtain an alarm fault cause. set. [159] Optionally, the alarm information processing unit 82 combines and classifies the alarm failure causes according to the fault location target, and obtains the combined classification of the alarm name sets, and is also used to calculate the combined classification. The degree of association of the subsequent alarm name set, the correlation degree is used to verify the validity of the merged classification.
  • the alarm data processing unit 83 establishes a correspondence between the alarm set and the alarm failure cause set, and includes: establishing, according to a correspondence between the alarm name and an alarm failure cause, A bipartite graph of the set of alarms and the set of alarm failure causes.
  • the alarm data processing unit 83 determines a probability of occurrence of a fault cause corresponding to the alarm set in the preset time window, and includes: calculating a probability of occurrence of each alarm in the alarm set in the time window; Determining a probability of occurrence of a fault cause corresponding to each type of alarm; determining a probability of occurrence of a fault cause corresponding to the alarm set in the time window according to a probability of occurrence of each type of alarm and a probability of occurrence of each type of fault cause; The probability of occurrence of the fault cause that meets the preset range is selected as the fault cause of the alarm set.
  • the alarm data processing unit 83 calculates a probability of occurrence of each alarm in the alarm set in the predetermined time window, including: using a frequency of occurrence of the alarm within the predetermined time window as a probability of occurrence of the alarm within the predetermined time window .
  • the alarm data processing unit 83 calculates a probability of occurrence of the fault cause corresponding to each type of alarm, including: initializing a probability of occurrence of the fault cause of each alarm; within the predetermined time window Calculate and verify the probability of occurrence of the failure cause of each of the alarms.
  • the alarm data processing unit 83 calculates a probability of occurrence of the fault cause corresponding to each type of alarm, and further includes: updating a probability of occurrence of a corresponding fault cause of each type of the alarm.
  • the alarm data processing unit 83 determines a probability of occurrence of a fault cause corresponding to the alarm set in the time window, and includes: selecting a probability that a maximum fault cause occurs or selecting a preset preset criterion The probability that the cause of the fault occurs is the cause of the fault of the alarm set.
  • the fault tier locating unit 84 is configured to perform hierarchical merging and categorizing the fault causes of the alarm set to obtain a merging reason of each layer; calculate a fault cause to the upper tier to complete the fault.
  • the optional device may be integrated in the terminal, or may be deployed independently. This embodiment is not limited.
  • the embodiment of the present invention further provides a server.
  • the structure of the server is as shown in FIG. 9.
  • the server 9 includes: a memory 91, a transceiver 92, and a processor 93, wherein [170] the memory 91, used for storing alarm design and description documents of each device in the network;
  • the transceiver 92 is configured to acquire an alarm design and description file of each device in the storage 81, and extract alarm information from the alarm design and description document;
  • the processor 93 is configured to establish a correspondence between the alarm name and the alarm fault cause in the alarm information. According to the fault location target, the alarm fault causes are combined and classified to obtain an alarm fault cause set. ;
  • the transceiver 92 is further configured to acquire an alarm set of alarm data in the live network.
  • the processor 93 is further configured to establish a correspondence between the alarm set and the alarm failure cause set, where the alarm set includes: an alarm name set; and determining the alarm set in a preset time window.
  • the probability of occurrence of the corresponding fault cause is selected; the probability of occurrence of the fault cause that meets the preset range is selected as the fault cause of the alarm set; the fault cause of the alarm set is hierarchically classified to achieve hierarchical positioning of the fault cause.
  • the processor combines the alarm fault causes according to the fault location target, and obtains the alarm fault cause set, including: setting a fault location target according to the device fault location principle; And combining the alarm fault causes to obtain a set of alarm fault causes.
  • the processor combines the alarm failure causes according to the fault location target And the processor is further configured to calculate the association degree of the alarm name set after the combination classification, where the correlation degree is used to verify the validity of the merged classification.
  • the establishing, by the processor, the corresponding relationship between the alarm set and the alarm fault cause set includes: establishing, according to a correspondence between the alarm name and an alarm fault cause, the alarm set and A bipartite graph of the set of alarm failure causes.
  • the determining, by the processor, a probability that the fault cause corresponding to the alarm set in the preset time window occurs including: calculating a probability of occurrence of each alarm in the alarm set in the preset time window; The probability of occurrence of the fault cause corresponding to each type of alarm; determining the probability of occurrence of the fault cause corresponding to the alarm set in the preset time window according to the probability of occurrence of each type of alarm and the probability of occurrence of each type of fault cause And selecting, according to the preset range, a probability that the fault cause occurs as a fault cause of the alarm set;
  • the calculating, by the processor, the probability of occurrence of each type of alarm in the set of alarms in the predetermined time window comprises: using a frequency of occurrence of the alarm within the predetermined time window as a probability of occurrence of the alarm within the predetermined time window.
  • the calculating, by the processor, the probability of occurrence of the fault cause corresponding to each type of alarm includes: initializing a probability of occurrence of the fault cause of each type of alarm; calculating and verifying the location within the predetermined time window The probability of the cause of the failure of each type of alarm.
  • the calculating, by the processor, the probability of occurrence of the fault cause corresponding to each type of alarm further includes: updating a probability of occurrence of a corresponding fault cause of each type of alarm.
  • the determining, by the processor, a probability of occurrence of a fault cause corresponding to the alarm set in the preset time window including: selecting a probability of occurrence of a maximum fault cause or selecting the fault that meets a preset range
  • the probability of occurrence of the cause is the cause of the failure of the alarm set.
  • the processor performs hierarchical hierarchical classification on the fault cause of the alarm set, and implements hierarchical positioning of the fault cause, including: hierarchically merging and categorizing the fault causes of the alarm set, and obtaining layers of each The reason for the combination; the cause of the failure is calculated step by step to the upper layer to complete the hierarchical positioning of the fault location target.
  • the embodiment of the present invention further provides a terminal, where the terminal includes: a server, the server, such as the server, includes: a memory and a processor, and functions and functions of the memory and the processor are detailed. The above is not described here.
  • the UE may be any one of the following, and may be static or mobile.
  • the static UE may specifically be a terminal, a mobile station (mobi le station), and a user.
  • the mobile UE may specifically include a cel lular phone, a personal digital assistant (PDA), a modem, a wireless communication device, and a handheld device ( Handhel d), laptop computer ⁇ cordless phone or wire less local loop (WLL) station, etc., the above UEs can be distributed throughout the wireless network.
  • PDA personal digital assistant
  • Handhel d handheld device
  • WLL local loop
  • the present invention can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases the former is more Good implementation.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM, a disk. , an optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention or portions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Disclosed in an embodiment of the present invention are a fault location method and device, the method comprising: extracting the alarm information of each device in a network, and establishing a correspondence between an alarm name and an alarm fault cause in the alarm information; according to a fault location target, merging and classifying the alarm fault causes to obtain a set of alarm fault causes; acquiring the alarm set of alarm data in a current network, and establishing a correspondence between the alarm set and the set of alarm fault causes, the alarm set comprising a set of alarm names; determining the probability of the fault cause occurrence corresponding to the alarm set in a preset time window, and selecting the probability of the fault cause occurrence satisfying a preset range as the fault cause of the alarm set; and hierarchically merging and classifying the fault causes of the alarm set to realize hierarchical locations of the fault causes. The embodiment of the present invention solves the technical problem in the prior art of inaccurate fault location and low location efficiency.

Description

一种故障定位方法及装置 本发明要求于 2013年 10月 8日提交中国专利局、申请号为 201310467700. 7、 发明名称为 "一种故障定位方法及装置" 的中国专利申请的优先权, 其全部内容 通过引用结合在本发明中。 技术领域  The invention claims the priority of the Chinese patent application filed on October 8, 2013, the Chinese Patent Office, the application number is 201310467700. 7, the invention name is "a fault location method and device", The entire contents are incorporated herein by reference. Technical field
[01] 本发明涉及网络技术领域, 特别涉及一种故障定位方法及装置。 背景技术 [01] The present invention relates to the field of network technologies, and in particular, to a fault location method and apparatus. Background technique
[02] 网络设备 (网元) 在运行时, 会根据自身的运行状态和环境变化生成不同的告 警信息, 所述告警信息包括: 告警参考文档和告警故障原因, 并将告警信息发送到网 管系统,以便于网管及运维人员根据该告警信息中的告警参考文档对可能的故障原因 进行排查, 从而定位网络故障原因。 但是, 由于网管系统收集到大量的告警信息, 并 且还存在大量的提示告警和衍生告警,从而降低了网管及运维人员根据告警信息进行 故障定位的效率。 [03] 基于此, 为了提高故障定位的效率, 现有技术中, 一种方法是通过分析告警间 的相关性 (即关联告警分析), 去掉衍生告警, 也就是根据告警字段信息使用特定逻 辑关系进行关联告警分析, 去掉衍生告警。 另一种方法是使用概率诊断模型, 计算告 警条件下故障发生的概率, 也就是说,通过先验故障原因概率判断告警是否会在预设 时间窗内发生, 并设置概率阈值, 然后, 根据告警是否在预设时间窗内发生来判断故 障原因, 并计算告警情况下故障发生的概率。  [02] The network device (network element) generates different alarm information according to its own running status and environment changes during operation. The alarm information includes: an alarm reference document and an alarm fault cause, and sends the alarm information to the network management system. Therefore, the network management and operation and maintenance personnel can check the possible fault causes according to the alarm reference document in the alarm information, thereby locating the cause of the network failure. However, because the network management system collects a large amount of alarm information, and there are also a large number of prompt alarms and derivative alarms, the efficiency of network management and operation and maintenance personnel to locate faults based on alarm information is reduced. [03] Based on this, in order to improve the efficiency of fault location, in the prior art, one method is to remove the derivative alarm by analyzing the correlation between the alarms (ie, the associated alarm analysis), that is, using a specific logical relationship according to the alarm field information. Perform correlation alarm analysis to remove derivative alarms. Another method is to use a probabilistic diagnosis model to calculate the probability of a fault occurring under an alarm condition, that is, to determine whether the alarm will occur within a preset time window by a priori probability of failure, and set a probability threshold, and then, according to the alarm Whether it occurs within the preset time window to determine the cause of the fault and calculate the probability of the fault occurring under the alarm condition.
[04] 但是, 在对现有技术的研究和实践过程中, 本发明的发明人发现, 现有的实现 方式中, 关联告警分析可以去掉衍生告警, 但并不能实现对告警故障的定位; 而在使 用概率诊断模型时, 由于同一告警可能对应多种故障原因, 或者同一故障可能产生多 条告警, 因此, 故障定位的准确度和效率都比较低。 发明内容 [04] However, in the research and practice of the prior art, the inventors of the present invention found that in the existing implementation manner, the associated alarm analysis can remove the derivative alarm, but the positioning of the alarm fault cannot be achieved; When the probability diagnosis model is used, the accuracy and efficiency of fault location are relatively low because the same alarm may correspond to multiple fault causes, or multiple faults may be generated by the same fault. Summary of the invention
[05] 本发明实施例中提供了一种故障定位方法及装置, 以解决现有技术中故障定位 不准确和定位效率低的技术问题。 [06] 为了解决上述技术问题, 本发明实施例公开了如下技术方案: [05] A fault location method and apparatus are provided in the embodiment of the present invention to solve the technical problem of inaccurate fault location and low positioning efficiency in the prior art. [06] In order to solve the above technical problem, the embodiment of the present invention discloses the following technical solutions:
[07] 第一方面提供了一种故障定位方法, 所述方法包括: [07] The first aspect provides a fault location method, the method comprising:
[08] 提取网络中各设备的告警信息, 建立所述告警信息中的告警名称和告警故障原 因之间的对应关系; [09] 按照故障定位目标对所述告警故障原因进行合并分类,得到告警故障原因集合; [10] 获取现网中告警数据的告警集合, 建立所述告警集合和所述告警故障原因集合 的对应关系; 其中, 所述告警集合包括: 告警名称集合; [08] extracting the alarm information of each device in the network, and establishing a correspondence between the alarm name and the alarm fault cause in the alarm information; [09] combining the alarm fault causes according to the fault location target, and obtaining an alarm [10] The alarm set of the alarm data in the current network is obtained, and the corresponding relationship between the alarm set and the alarm fault cause set is established; wherein the alarm set includes: a set of alarm names;
[11] 确定预定时间窗内所述告警集合对应的故障原因发生的概率; [11] determining a probability of occurrence of a fault cause corresponding to the alarm set in a predetermined time window;
[12] 选取满足预设范围的所述故障原因发生的概率作为所述告警集合的故障原因; [13] 对所述告警集合的故障原因进行层级合并分类, 实现故障原因的层级定位。 [12] The probability of occurrence of the fault cause that meets the preset range is selected as the fault cause of the alarm set; [13] hierarchically classifying the fault cause of the alarm set to achieve hierarchical positioning of the fault cause.
[14] 在第一方面的第一种可能的实现方式中, 所述按照故障定位目标, 对所述告警 故障原因合并分类, 得到告警故障原因集合包括: [14] In a first possible implementation manner of the first aspect, the collecting, by the fault location target, the classification of the alarm fault causes, and obtaining the alarm fault cause set includes:
[15] 按照设备故障定位原则, 设定故障定位目标; [15] According to the principle of equipment fault location, set the fault location target;
[16] 按照所述故障定位目标, 对所述告警故障原因进行合并分类, 得到告警故障原 因集合。 [16] According to the fault location target, the alarm fault causes are combined and classified to obtain an alarm fault cause set.
[17] 结合第一方面或第一方面的第一种可能的实现方式, 在第二种可能的实现方式 中,所述按照所述故障定位目标对所述告警故障原因进行合并分类,还得到告警名称 集; 所述方法还包括: [17] In conjunction with the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner, the merging and categorizing the alarm fault causes according to the fault locating target is further obtained The alarm name set; the method further includes:
[18] 计算合所述告警名称集的关联度,所述关联度用于验证所述合并分类的有效性。 [19] 结合第一方面或第一方面的第一种或第二种可能的实现方式, 在第三种可能的 实现方式中, 所述建立所述告警集合和所述告警故障原因集合的对应关系, 包括: [18] Calculating the degree of association of the set of alarm names, the degree of association is used to verify the validity of the merged category. [19] In combination with the first aspect or the first or second possible implementation manner of the first aspect, in a third possible implementation, the establishing, by the establishing, the alarm set and the alarm failure cause set Relationships, including:
[20] 根据所述告警名称和告警故障原因之间的对应关系, 建立所述告警集合和所述 告警故障原因集合的二分图。 [21] 结合第一方面或第一方面的第一种或第二种或第三种可能的实现方式, 在第四 种可能的实现方式中,所述确定预定时间窗内所述告警集合对应的故障原因发生的概 率, 包括: [20] Establishing a bipartite graph of the alarm set and the alarm failure cause set according to the correspondence between the alarm name and the alarm failure cause. [21] In combination with the first aspect or the first or the second or the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the determining, by the predetermined time window, the alarm set corresponds to The probability of the cause of the failure, including:
[22] 计算预定时间窗内告警集合中每种告警发生的概率; [23] 计算所述每种告警对应的故障原因发生的概率; [22] calculating a probability of occurrence of each type of alarm in the alarm set in the predetermined time window; [23] calculating a probability of occurrence of the fault cause corresponding to each type of alarm;
[24] 根据所述每种告警发生的概率和对应的每种故障原因发生的概率, 确定所述预 定时间窗内所述告警集合对应的故障原因发生的概率。 [24] determining a probability of occurrence of a fault cause corresponding to the alarm set in the predetermined time window according to the probability of occurrence of each type of alarm and the probability of occurrence of each corresponding fault cause.
[25] 结合第一方面或第一方面的第一种或第二种或第三种或第四种可能的实现方 式,在第五种可能的实现方式中,所述计算预定时间窗内告警集合中每种告警发生的 概率, 包括: [25] In combination with the first aspect or the first or second or the third or the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, the calculating a predetermined time window alarm The probability of occurrence of each type of alarm in the set, including:
[26] 使用预定时间窗内的告警发生的频次作为预设时间窗内告警发生的概率。 [26] The frequency of occurrence of the alarm within the predetermined time window is used as the probability of occurrence of the alarm within the preset time window.
[27] 结合第一方面或第一方面的第一种或第二种或第三种或第四种或第五种可能的 实现方式,在第六种可能的实现方式中,所述计算所述每种告警对应的故障原因发生 的概率, 包括: [28] 初始化所述每种告警的故障原因发生的概率; [27] In conjunction with the first aspect or the first or second or third or fourth or fifth possible implementation of the first aspect, in a sixth possible implementation, the computing The probability of occurrence of the fault cause corresponding to each type of alarm includes: [28] initializing the probability of occurrence of the fault cause of each type of alarm;
[29] 在所述预定时间窗内计算并验证所述每种告警的故障原因发生的概率; [29] calculating and verifying a probability of occurrence of a failure cause of each of the alarms within the predetermined time window;
[30] 更新所述每种告警对应的故障原因发生的概率。 [30] Update the probability of occurrence of the fault cause corresponding to each type of alarm.
[31] 结合第一方面或第一方面的第一种或第二种或第三种或第四种或第五种或第六 种可能的实现方式,在第七种可能的实现方式中,所述对所述告警集合的故障原因进 行层级合并分类, 实现故障原因的层级定位, 包括: [31] In combination with the first aspect or the first or second or third or fourth or fifth or sixth possible implementation of the first aspect, in a seventh possible implementation, Performing hierarchical hierarchical classification on the fault cause of the alarm set to implement hierarchical positioning of the fault cause, including:
[32] 对所述告警集合的故障原因进行层级合并分类, 得到各层的合并原因; [32] hierarchically classifying the fault causes of the alarm set, and obtaining the reason for the merger of each layer;
[33] 逐级向上层计算出故障原因, 以完成故障定位目标的层级定位。 [33] Calculate the cause of the fault step by step to complete the hierarchical positioning of the fault location target.
[34] 第二方面提供了一种故障定位装置, 包括: [35] 提取单元, 用于提取网络中各设备的告警信息; [34] The second aspect provides a fault location device, including: [35] an extracting unit, configured to extract alarm information of each device in the network;
[36] 第一建立单元, 用于建立所述告警信息中告警名称和告警故障原因之间的对应 关系;  [36] a first establishing unit, configured to establish a correspondence between an alarm name and an alarm failure cause in the alarm information;
[37] 处理单元, 用于按照故障定位目标对所述告警故障原因进行合并分类, 得到告 警故障原因集合;  [37] a processing unit, configured to combine and classify the alarm fault causes according to the fault location target, and obtain a set of alarm fault causes;
[38] 获取单元, 用于获取现网中告警数据的告警集合;  [38] an acquiring unit, configured to acquire an alarm set of alarm data in the live network;
[39] 第二建立单元, 用于建立所述告警集合和所述告警故障原因集合的对应关系; 其中, 所述告警集包括: 告警名称集合;  [39] a second establishing unit, configured to establish a correspondence between the alarm set and the alarm failure cause set; wherein the alarm set includes: a set of alarm names;
[40] 确定单元, 用于确定预定时间窗内所述告警集合对应的故障原因发生的概率;  [40] a determining unit, configured to determine a probability of occurrence of a fault cause corresponding to the alarm set in a predetermined time window;
[41] 选取单元, 用于选取满足预设范围的所述故障原因发生的概率作为所述告警集 合的故障原因; [41] a selecting unit, configured to select a probability that the fault cause that meets the preset range occurs as a fault cause of the alarm set;
[42] 定位单元, 用于对所述告警集合的故障原因进行层级合并分类, 实现故障原因 的层级定位。  [42] The locating unit is configured to hierarchically classify the fault causes of the alarm set, and implement hierarchical positioning of the fault cause.
[43] 在第二方面的第一种可能的实现方式中, 所述处理单元包括:  [43] In a first possible implementation manner of the second aspect, the processing unit includes:
[44] 设定单元, 用于按照设备故障定位原则, 设定故障定位目标; [44] setting unit, used to set the fault location target according to the principle of equipment fault location;
[45] 分类单元, 用于按照所述故障定位目标, 对所述告警故障原因进行合并分类, 得到告警故障原因集合。  [45] The classification unit is configured to combine and classify the alarm fault causes according to the fault location target, and obtain an alarm fault cause set.
[46] 结合第二方面或第二方面的第一种可能的实现方式, 在第二种可能的实现方式 中,  [46] In combination with the second aspect or the first possible implementation of the second aspect, in a second possible implementation manner,
[47] 所述分类单元, 还用于在按照所述故障定位目标, 对所述告警故障原因进行合 并分类后, 还得到告警名称集; 所述装置还包括: [47] The classification unit is further configured to: after combining the alarm fault causes according to the fault location target, further obtaining an alarm name set; the device further includes:
[48] 计算单元, 用于计算合并分类后的所述告警名称集的关联度, 所述关联度用于 验证所述合并分类的有效性。 [49] 结合第二方面或第二方面的第一种或第二种可能的实现方式, 在第三种可能的 实现方式中,所述第一建立单元, 具体用于根据所述告警名称和告警故障原因之间的 对应关系, 建立所述告警集合和所述告警故障原因集合的二分图。 [48] a calculating unit, configured to calculate a degree of association of the set of alarm names after the merged classification, where the degree of association is used to verify the validity of the merged category. [49] In combination with the second aspect or the first or second possible implementation manner of the second aspect, in a third possible implementation, the first establishing unit is specifically configured to use the alarm name and A correspondence between the alarm failure causes and a bipartite graph of the alarm failure cause set is established.
[50] 结合第二方面或第二方面的第一种或第二种或第三种可能的实现方式, 在第四 种可能的实现方式中, 所述确定单元包括: [50] In combination with the second aspect or the first or second or the third possible implementation of the second aspect, in a fourth possible implementation, the determining unit includes:
[51] 第一概率计算单元, 用于计算预定时间窗内告警集合中每种告警发生的概率; [51] a first probability calculation unit, configured to calculate a probability of occurrence of each type of alarm in the alarm set in the predetermined time window;
[52] 第二概率计算单元, 用于计算所述每种告警对应的故障原因发生的概率; [52] a second probability calculation unit, configured to calculate a probability of occurrence of a fault cause corresponding to each type of alarm;
[53] 概率确定单元, 用于根据所述每种告警发生的概率和对应的每种故障原因发生 的概率, 确定所述预定时间窗内所述告警集合对应的故障原因发生的概率。 [54] 结合第二方面或第二方面的第一种或第二种或第三种或第四种可能的实现方 式, 在第五种可能的实现方式中, [53] The probability determining unit is configured to determine, according to the probability of occurrence of each type of alarm and the probability of occurrence of each corresponding fault cause, a probability of occurrence of a fault cause corresponding to the alarm set in the predetermined time window. [54] In combination with the first or second or third or fourth possible implementation of the second aspect or the second aspect, in a fifth possible implementation,
[55] 所述第一概率计算单元, 具体用于使用预定时间窗内的告警发生的频次作为预 设时间窗内告警发生的概率。 [55] The first probability calculation unit is specifically configured to use a frequency of occurrence of an alarm within a predetermined time window as a probability of occurrence of an alarm within a preset time window.
[56] 结合第二方面或第二方面的第一种或第二种或第三种或第四种或第五种可能的 实现方式, 在第六种可能的实现方式中, 所述第二概率计算单元包括: [56] In combination with the first or second or third or fourth or fifth possible implementation of the second aspect or the second aspect, in a sixth possible implementation, the second The probability calculation unit includes:
[57] 初始化单元, 用于初始化所述每种告警的故障原因发生的概率; [57] an initialization unit, configured to initialize a probability of occurrence of the fault cause of each type of alarm;
[58] 验证单元, 用于在所述预定时间窗内计算并验证所述每种告警的故障原因发生 的概率; [58] a verification unit, configured to calculate and verify a probability of occurrence of a failure cause of each type of alarm within the predetermined time window;
[59] 更新单元, 用于更新所述每种告警的对应的故障原因发生的概率。 [60] 结合第二方面或第二方面的第一种或第二种或第三种或第四种或第五种或第六 种可能的实现方式, 在第七种可能的实现方式中, 所述定位单元包括: [59] an update unit, configured to update a probability of occurrence of a corresponding fault cause of each of the alarms. [60] In combination with the first or second or third or fourth or fifth or sixth possible implementation of the second aspect or the second aspect, in a seventh possible implementation, The positioning unit includes:
[61] 层级分类单元, 用于对所述告警集合的故障原因进行层级合并分类, 得到各层 的合并原因; [61] A hierarchical classification unit, configured to hierarchically classify fault causes of the alarm set, and obtain a reason for merging each layer;
[62] 层级定位单元, 用于逐级向上层计算出故障原因, 以完成故障定位目标的层级 定位。 [62] Hierarchical locating unit, which is used to calculate the cause of the fault step by step to complete the level of the fault locating target Positioning.
[63] 第三方面提供了一种故障定位装置, 包括: [63] The third aspect provides a fault location device, including:
[64] 告警信息提取单元, 用于提取网络中各设备的告警信息, 建立所述告警信息中 的告警名称和告警故障原因之间的对应关系; [65] 告警信息处理单元,用于按照故障定位目标对所述告警故障原因进行合并分类, 得到告警故障原因集合; [64] The alarm information extracting unit is configured to extract the alarm information of each device in the network, and establish a correspondence between the alarm name and the alarm fault cause in the alarm information; [65] the alarm information processing unit is configured to follow the fault. The positioning target combines and classifies the alarm fault causes to obtain a set of alarm fault causes;
[66] 告警数据处理单元, 用于获取告警数据的告警集合, 建立所述告警集合和所述 告警故障原因集合的对应关系; 其中, 所述告警集合包括: 告警名称集合; 确定预定 时间窗内所述告警集合对应的故障原因发生的概率;选取满足预设范围的所述故障原 因发生的概率作为所述告警集合的故障原因; [66] an alarm data processing unit, configured to acquire an alarm set of alarm data, and establish a correspondence between the alarm set and the alarm failure cause set; wherein the alarm set includes: an alarm name set; determining a predetermined time window The probability of occurrence of the fault cause corresponding to the alarm set; selecting the probability that the fault cause occurs in the preset range as the fault cause of the alarm set;
[67] 故障层级定位单元, 用于对所述告警集合的故障原因进行层级合并分类, 实现 故障原因的层级定位。 [67] The fault level locating unit is configured to hierarchically classify the fault causes of the alarm set to achieve hierarchical positioning of the fault cause.
[68] 在第三方面的第一种可能的实现方式中, 所述告警信息处理单元, 具体用于按 照设备故障定位原则, 设定故障定位目标; 按照所述故障定位目标对所述告警故障原 因进行合并分类, 得到告警故障原因集合。 [68] In a first possible implementation manner of the third aspect, the alarm information processing unit is specifically configured to: set a fault location target according to a device fault location principle; and perform the alarm fault according to the fault location target The reason is to perform a combined classification to obtain a set of alarm failure causes.
[69] 结合第三方面或第三方面的第一种可能的实现方式, 在第二种可能的实现方式 中, 所述告警信息处理单元, 在按照所述故障定位目标, 对所述告警故障原因进行合 并分类,得到合并分类后的所述告警名称集时,还用于计算合并分类后的所述告警名 称集的关联度, 所述关联度用于验证所述合并分类的有效性。 [70] 结合第三方面或第三方面的第一种或第二种可能的实现方式, 在第三种可能的 实现方式中,所述告警数据处理单元建立所述告警集合和所述告警故障原因集合的对 应关系, 包括: 根据所述告警名称和告警故障原因之间的对应关系, 建立所述告警集 合和所述告警故障原因集合的二分图。 In combination with the third aspect or the first possible implementation manner of the third aspect, in a second possible implementation manner, the alarm information processing unit is configured to perform the alarm fault according to the fault location target. The reason for performing the merging classification to obtain the alarm name set after the merging is also used to calculate the association degree of the alarm name set after the merging classification, and the association degree is used to verify the validity of the merging classification. In combination with the third aspect or the first or second possible implementation manner of the third aspect, in a third possible implementation manner, the alarm data processing unit establishes the alarm set and the alarm fault The correspondence between the set of causes includes: establishing a bipartite graph of the set of alarms and the set of alarm failure causes according to a correspondence between the alarm name and an alarm failure cause.
[71] 结合第三方面或第三方面的第一种或第二种或第三种可能的实现方式, 在第四 种可能的实现方式中,所述告警数据处理单元确定预设时间窗内所述告警集合对应的 故障原因发生的概率, 包括: 计算预设时间窗内告警集合中每种告警发生的概率; 计 算所述每种告警对应的故障原因发生的概率;根据所述每种告警发生的概率和对应的 每种故障原因发生的概率,确定所述预设时间窗内所述告警集合对应的故障原因发生 的概率; 选取满足预设范围的所述故障原因发生的概率作为所述告警集合的故障原 因。 [72] 结合第三方面或第三方面的第一种或第二种或第三种或第四种可能的实现方 式,在第五种可能的实现方式中,所述告警数据处理单元计算预设时间窗内告警集合 中每种告警发生的概率,包括: 使用预定时间窗内的告警发生的频次作为预定时间窗 内告警发生的概率。 [71] In combination with the third aspect or the first or second or third possible implementation manner of the third aspect, in a fourth possible implementation manner, the alarm data processing unit determines the preset time window The probability of occurrence of the fault cause corresponding to the alarm set includes: calculating a probability of occurrence of each alarm in the alarm set in the preset time window; Calculating a probability of occurrence of the fault cause corresponding to each type of alarm; determining a fault cause corresponding to the alarm set in the preset time window according to the probability of occurrence of each type of alarm and the probability of occurrence of each type of fault cause Probability of occurrence; selecting the probability of occurrence of the failure cause that satisfies the preset range as the cause of failure of the alarm set. [72] In combination with the first or second or third or fourth possible implementation manner of the third aspect or the third aspect, in a fifth possible implementation manner, the alarm data processing unit calculates a pre- The probability of occurrence of each type of alarm in the alarm set in the time window is set to include: the frequency of occurrence of the alarm within the predetermined time window as the probability of occurrence of the alarm within the predetermined time window.
[73] 结合第三方面或第三方面的第一种或第二种或第三种或第四种或第五种可能的 实现方式,在第六种可能的实现方式中,所述告警数据处理单元计算所述每种告警对 应的故障原因发生的概率, 包括: 初始化所述每种告警的故障原因发生的概率; 在所 述预定时间窗内计算并验证所述每种告警的故障原因发生的概率;更新所述每种告警 的对应的故障原因发生的概率。 [73] In combination with the first or second or third or fourth or fifth possible implementation manner of the third aspect or the third aspect, in the sixth possible implementation manner, the alarm data The processing unit calculates a probability of occurrence of the fault cause corresponding to each type of alarm, including: initializing a probability of occurrence of the fault cause of each type of alarm; calculating and verifying that the fault cause of each type of alarm occurs within the predetermined time window Probability; update the probability of occurrence of the corresponding failure cause for each of the alarms.
[74] 结合第三方面或第三方面的第一种或第二种或第三种或第四种或第五种或第六 种可能的实现方式, 在第七种可能的实现方式中, 所述故障层级定位单元, 具体用于 对所述告警集合的故障原因进行层级合并分类,得到各层的合并原因; 逐级向上层计 算出故障原因, 以完成故障定位目标的层级定位。 [74] In combination with the first or second or third or fourth or fifth or sixth possible implementation of the third aspect or the third aspect, in a seventh possible implementation, The fault tier locating unit is specifically configured to perform hierarchical merging and categorizing the fault causes of the alarm set to obtain a merging reason of each layer; and calculating a fault cause step by step to complete the hierarchical positioning of the fault locating target.
[75] 由上述技术方案可知, 本发明实施例中, 先从告警设计及说明文档中提取告警 信息, 然后按照故障定位目标, 对告警故障原因进行合并和分类, 并计算出时间窗内 的现网中告警集合(或告警序列)的故障原因发生的概率, 从而实现对故障原因的层 级定位, 提高了故障定位效率。 进一步, 通过对告警故障原因的合并分类, 也提高了 告警的相关性。 以及实时更新告警的故障原因概率,提高了告警的故障原因定位的准 确度。 附图说明 [76] 为了更清楚地说明本发明实施例或现有技术中的技术方案, 下面将对实施例中 所需要使用的附图作简单地介绍, 显而易见地, 下面描述中的附图仅仅是本发明的一 些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以根 据这些附图获得其他的附图。 [77 图 1为本发明实施例提供的一种故障定位方法的流程图; [78 图 2为本发明实施例提供的一种告警集合与故障原因集合的二分图的示意图; [75] According to the foregoing technical solution, in the embodiment of the present invention, the alarm information is first extracted from the alarm design and description document, and then the fault occurrence targets are combined and classified according to the fault location target, and the current time window is calculated. The probability of the failure of the alarm set (or alarm sequence) in the network occurs, thereby achieving hierarchical positioning of the fault cause and improving the fault location efficiency. Further, by correlating the causes of alarm failures, the correlation of alarms is also improved. As well as the probability of the failure of the alarm being updated in real time, the accuracy of the fault location of the alarm is improved. BRIEF DESCRIPTION OF THE DRAWINGS [76] In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments will be briefly described below. Obviously, the drawings in the following description It is merely some embodiments of the present invention, and those skilled in the art can obtain other drawings according to the drawings without any creative work. FIG. 1 is a flowchart of a method for locating a fault according to an embodiment of the present invention; [78] FIG. 2 is a schematic diagram of a bipartite graph of an alarm set and a fault cause set according to an embodiment of the present invention;
[79 图 3为本实施例中提供的一种使用告警发生的频率近似代替告警发生的概率的 示 :图; [79] FIG. 3 is a diagram showing an example of using a frequency of occurrence of an alarm to replace the probability of occurrence of an alarm provided in the embodiment; FIG.
[80 图 4为本发明实施例提供的更新故障原因发生的概率的示意图; [81 图 5为本发明实施例提供的一种故障层级定位示意图; [82 图 6为本发明实施例提供的一种故障原因合并分类示意图; [83 图 7为本发明实施例提供的一种故障定位装置的结构示意图; [84 图 8为本发明实施例提供的一种故障定位装置的另一结构示意图; [85 图 9为本发明实施例提供的一种服务器的结构示意图, 具体实施方式 FIG. 4 is a schematic diagram of a probability of occurrence of an update failure according to an embodiment of the present invention; [81] FIG. 5 is a schematic diagram of fault hierarchy positioning according to an embodiment of the present invention; [82] FIG. 6 is a schematic diagram of an embodiment of the present invention. FIG. 7 is a schematic structural diagram of a fault locating device according to an embodiment of the present invention; [84] FIG. 8 is another schematic structural diagram of a fault locating device according to an embodiment of the present invention; FIG. 9 is a schematic structural diagram of a server according to an embodiment of the present disclosure, and a specific implementation manner
[86] 本发明如下实施例提供了故障定位方法和装置。 本发明实施例中通过对故障 原因进行了合并和分类, 并计算出合并和分类后的故障原因发生的概率, 从而有效的 提高了故障障定位的效率和准确度。 [87] 下面将结合本发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整的描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而不是全部的实施 例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获 得的所有其他实施例, 都属于本发明保护的范围。 The following embodiments of the present invention provide a fault location method and apparatus. In the embodiment of the present invention, the causes of failures are combined and classified, and the probability of occurrence of the faults after the merger and classification is calculated, thereby effectively improving the efficiency and accuracy of the fault location. The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. example. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
[88] 请参阅图, 图 1为本发明实施例提供的一种故障定位方法的流程图, 在该实施 例中, 假设故障定位的目标网络中包含告警设计或说明文档, 告警包括: 告警名称和 告警的故障原因。 所述方法包括: Referring to the drawings, FIG. 1 is a flowchart of a method for locating a fault according to an embodiment of the present invention. In this embodiment, an alarm design or a description file is included in a target network of a fault location, and the alarm includes: And the cause of the alarm. The method includes:
[89] 步骤 101 : 获取网络中各设备的告警信息, 建立所述告警信息中的告警名称和 告警故障原因之间的对应关系; [89] Step 101: Obtain the alarm information of each device in the network, and establish a correspondence between the alarm name and the alarm failure cause in the alarm information.
[90] 在该实施例中, 服务器可以从网络中各设备的告警设计及说明文档(或参考文 档)中提取告警信息, 所述告警信息包括: 告警名称、 告警故障原因, 还可以包括告 警 ID等, 当并不限于此, 还可以包括其他参数, 本实施例不作限制。 [90] In this embodiment, the server can design and explain the alarm (or reference) from each device in the network. The alarm information is extracted from the file, and the alarm information includes: an alarm name, an alarm fault cause, and an alarm ID. The identifier is not limited thereto.
[91] 该实施例中, 通常情况下, 设备的告警信息都会包含: 告警的说明和参考文档, 用以对告警进行解释和说明。所述告警信息包括: 告警名称和告警故障原因(告警产 生的故障原因); 还可以包括: ID (编号) 以及告警级别 (说明告警的主要程度)等。 在故障定位时, 先提取出告警名称、 告警 ID和告警产生的故障原因, 并建立告警名 称及告警故障原因之间的对应关系,该对应关系如表 1所示,该表 1中只是以此为例, 并不限于此: [91] In this embodiment, the alarm information of the device usually includes: an explanation of the alarm and a reference document for explaining and explaining the alarm. The alarm information includes: an alarm name and an alarm failure cause (a cause of the alarm); and may include: an ID (number) and an alarm level (indicating the severity of the alarm). In the fault location, the alarm name, the alarm ID, and the fault cause of the alarm are extracted, and the corresponding relationship between the alarm name and the alarm fault cause is established. The correspondence is shown in Table 1. For example, it is not limited to this:
[92] 表 1 [92] Table 1
Figure imgf000011_0001
Figure imgf000011_0001
[93] 步骤 102:按照故障定位目标对所述告警故障原因进行合并分类,得到告警故障 原因集合; 其中, 告警故障原因集合就是对表 1中后得到各类告警故障原因进行合并 分类后得到集合。 [93] Step 102: Combine and classify the alarm fault causes according to the fault locating target, and obtain an alarm fault cause set; wherein, the alarm fault cause set is obtained by combining and categorizing various alarm fault causes in Table 1 to obtain a set. .
[94] 在该步骤中, 通常情况下, 服务器先按照设备故障定位原则, 设定故障定位目 标。 也就是说, 由于告警信息一般都上报到网络管理平台, 因此, 告警信息包括全网 的故障信息。 故障定位目标是指需要定位的层级。对于一个网络系统, 故障定位目标 可以设置为一台设备, 或者设备的一个模块, 或者设备模块的软件、硬件和配置问题 等等。 故障定位目标的设计原则可以和网络的组成, 设备的模块设计相关联。 [95] 然后, 按照所述故障定位目标对所述告警故障原因进行合并分类, 得到告警故 障原因集合。 也就是说, 在设计好故障定位目标后, 可以按照故障定位目标对告警故 障原因进行合并分类。 比如, 可以按照设备的模块进行划分, 如一个设备可能包含 3 个模块, 而故障定位目标是定位到模块, 就可以将该模块内的所有故障原因进行合并 和分类。 [94] In this step, the server first sets the fault location target according to the device fault location principle. That is, since the alarm information is generally reported to the network management platform, the alarm information includes fault information of the entire network. The fault location target is the level that needs to be located. For a network system, the fault location target can be set to one device, or one module of the device, or the software, hardware, and configuration problems of the device module. The design principles of the fault location target can be related to the composition of the network and the module design of the device. [95] Then, the alarm fault causes are combined and classified according to the fault location target, and an alarm fault cause set is obtained. That is to say, after the fault location target is designed, the alarm fault causes can be combined and classified according to the fault location target. For example, it can be divided according to the module of the device. For example, a device may contain 3 modules, and the fault location target is to locate the module, and all the fault causes in the module can be combined and classified.
[96] 在该实施例中, 以使用自然语言处理 (NLP, Natural Language Process ) 方 法对故障原因进行合并和分类。 其故障原因合并部分示例如表 2所示: [96] In this embodiment, the causes of failures are combined and classified by using the Natural Language Process (NLP) method. An example of the combination of the causes of the failure is shown in Table 2:
[97] 表 2 [97] Table 2
Figure imgf000012_0001
Figure imgf000012_0001
[98] 由表 2可知, 该实施例中, 按照故障定位目标, 可以将告警故障原因 (即初始 故障原因合) 并成 "合并原因 (1 ) ", 也可以合并成 "合并原因 (2 ) "。 显然, "合并 原因 (2 ) "较 "合并原因 (1 ) " 的粒度更粗。  [98] As can be seen from Table 2, in this embodiment, according to the fault location target, the cause of the alarm fault (ie, the initial fault cause) may be merged into a "merger reason (1)", or may be merged into a "merger cause (2). ". Obviously, the "merge reason (2)" is coarser than the "merge reason (1)".
[99] 可选的, 在另一实施例中, 按照所述故障定位目标, 对所述告警故障原因进行 合并分类, 还得到告警名称集; 计算合并分类后的所述告警名称集(也可以称为告警 集) 的关联度, 所述关联度用于验证所述合并分类的有效性。 其中, 所述告警名称集 就是对表 1中的各类告警名称进行合并分类后得到集合。当然,对所述告警故障原因 进行合并分类后, 还可以得到对应的告警 ID集合, 其中, 告警 ID集合对表 1中的各 类 ID进行分类合并后得到的集合。 [100]也就是说,对于告警集/ Α2, ···, Α ,如果合并分类前的告警集关联度为 /。, 合并分类后的告警集关联度为 /:, 那么, 当 /:≥/。时, 合并为有效合并。 [99] Optionally, in another embodiment, according to the fault location target, the alarm fault causes are combined and classified, and an alarm name set is also obtained; and the alarm name set after the combined classification is calculated (also The degree of association, referred to as an alarm set, is used to verify the validity of the merged classification. The alarm name set is a combination of the various types of alarm names in Table 1. Certainly, after the combination of the alarm failure causes, the corresponding alarm ID set may be obtained, where the alarm ID set is a set obtained by classifying and combining the various types of IDs in Table 1. [100] That is to say, for the alarm set / Α 2 , ···, Α , if the alarm set association degree before the merge classification is /. The association degree of the alarm set after the combination is /:, then, when /: ≥ /. When merged into a valid merge.
[101]其中, 计算合并分类后的所述告警名称集 (也可以称为告警集) 的关联度: [101] wherein, the association degree of the alarm name set (which may also be referred to as an alarm set) after the combined classification is calculated:
[102]先定义: 告警相似度和告警集的关联度 [103]其中, 所述告警相似度是指告警相似的程度。设任意两个告警 Α和 Β, 其告警故 障原因集分别为 U (A)和 U (B), 则定义告警 A和 B的相似度为: [102] First definition: alarm similarity and association degree of the alarm set [103] wherein the alarm similarity refers to the degree of similarity of the alarm. Set any two alarms Α and Β whose alarm failure cause sets are U (A) and U (B) respectively, then define the similarity between alarms A and B as:
S = U(A n B) S = U(A n B)
A'B U{A V B) A' B U{AVB)
[104]在该公式中, U (A U B)为告警故障原因集 U (A)和 U (B)的并集, υ (Α Π Β)为告警故 障原因集 U (Α)和 U (Β)的交集; 为告警 Α和 Β的相似度。 [105]所述告警集的关联度是指告警集中, 存在相似告警的程度。 设告警集 I= {A1, Α2,···, An} , 定义告警集 I的关联度为:
Figure imgf000013_0001
[104] In this formula, U (AUB) is the union of the alarm failure cause sets U (A) and U (B), and υ (Α Π Β) is the alarm failure cause set U (Α) and U (Β) The intersection of the alarms and the 相似. [105] The degree of association of the alarm set refers to the degree of similar alarms in the alarm set. Let the alarm set I= {A1, Α2,···, An} be defined as the correlation degree of the alarm set I:
Figure imgf000013_0001
[106]当然, 本实施例中, 并不限于上述计算告警集的关联度, 还可以是其他类似的 方法, 本实施例不作限制。 [107]在该实施例中, 故障定位目标往往不同。 比如, 对于一个单板故障引起的告警, 可能是是希望定位到故障发生的设备, 也可能是故障发生的单板, 或者是单板的硬件 或软件故障, 等等。 针对不同的故障定位目标, 可以对告警故障原因合并。 具体可以 采用上述方法对告警故障原因进行合并, 当然, 也可以采用其他方法对告警故障原因 进行合并, 本实施例不作限制。 [108]步骤 103:获取现网中告警数据的告警集合,建立所述告警集合和所述告警故障 原因集合的对应关系; 其中, 所述告警集合包括: 告警名称集合;  [106] Of course, in this embodiment, it is not limited to the degree of association of the foregoing calculation alarm set, and may be other similar methods, which is not limited in this embodiment. In this embodiment, the fault location targets are often different. For example, the alarm caused by a fault in a board may be the device that is expected to locate the fault, or the board where the fault occurs, or the hardware or software fault of the board. The alarm failure causes can be combined for different fault location targets. Specifically, the foregoing method may be used to combine the causes of the alarm faults. Of course, other methods may be used to combine the causes of the alarm faults, which is not limited in this embodiment. [108] Step 103: Obtain an alarm set of alarm data in the current network, and establish a correspondence between the alarm set and the alarm failure cause set; wherein the alarm set includes: a set of alarm names;
[109]其中, 所述告警数据: 是指从现网获取的告警, 所述告警包括告警名称, 告警 发生的时间, 告警频次等。 告警集合包括: 各类告警名称, 各类告警发生的时间, 各 类告警频次等。 [109] The alarm data: refers to an alarm obtained from the current network, where the alarm includes an alarm name, a time when the alarm occurs, and an alarm frequency. The alarm set includes: various alarm names, the time when various types of alarms occur, and each Class alarm frequency, etc.
[110]其中, 该步骤中, 服务器建立所述告警集合和所述告警故障原因集合的对应关 系, 可以是根据所述告警名称和告警故障原因之间的对应关系, 建立所述告警集 合和所述告警故障原因集合的二分图 (bipartite map), 其中, 二分图是图论中的一 种特殊模型, 对于 G= (V,E)的无向图, 如果顶点 V可以分割为两个不同的集合 A、 B, 且 E=〈i, j>中的顶点分别属于 A、 B, 则称 G为二分图。 [110] wherein, in the step, the server establishes a correspondence between the alarm set and the alarm failure cause set, and may establish the alarm set and the location according to a correspondence between the alarm name and an alarm failure cause. A bipartite map of a set of alarm failure causes, wherein the bipartite graph is a special model in graph theory, and for an undirected graph of G=(V, E), if the vertex V can be split into two different The set A, B, and the vertices in E=<i, j> belong to A and B respectively, and then G is a bipartite graph.
[111]设获取现网的告警集合为 I= {A1, A2,〜, An}, 对应的故障原因集合为 F= {fl, Ϊ2,…, fm},建立告警集合与故障原因集合的二分图,其告警集合与故障原因集 合的二分图如图 2所示。 [112]步骤 104: 确定预定时间窗内所述告警集合对应的故障原因发生的概率; [111] Set the alarm set of the current network to be I={A1, A2, ~, An}, and the corresponding fault cause set is F={fl, Ϊ2,..., fm}, and establish the alarm set and the fault cause set. Figure 2 shows a bipartite graph of its alarm set and fault cause set. [112] Step 104: determining a probability of occurrence of a fault cause corresponding to the alarm set in a predetermined time window;
[113]在该步骤中, 计算在已知告警集合(也可以称为告警序列)下故障原因的概率。 可以使用基于贝叶斯网络、马尔科夫链等方法计算各故障原因发生的概率。其计算方 法过程为: 设置时间窗 T, 设时间窗内的告警集合为 Ι= {Α1, Α2, ···, An} , 分别计算 该时间窗内告警发生的概率和每种告警对应的各种故障原因发生的概率,然后计算该 告警集合下故障原因发生的概率。下面以基于贝叶斯网络的方法为例, 说明计算所述 告警集合的故障原因发生的概率的实现过程: [113] In this step, the probability of a cause of failure under a known set of alarms (which may also be referred to as an alarm sequence) is calculated. The probability of occurrence of each failure cause can be calculated using a Bayesian network, a Markov chain, or the like. The calculation method is as follows: setting a time window T, setting the alarm set in the time window to Ι={Α1, Α2, ···, An}, respectively calculating the probability of occurrence of the alarm in the time window and each corresponding to each alarm The probability of the occurrence of the failure, and then calculate the probability of the failure cause under the alarm set. The following takes the Bayesian network-based method as an example to describe the implementation process of calculating the probability of occurrence of the fault cause of the alarm set:
[114]首先, 计算时间窗内告警集合中每种告警发生的概率; [114] First, calculating the probability of occurrence of each type of alarm in the alarm set in the time window;
[115]设时间窗内告警集合 (或告警序列) Ι= {Α1, Α2, ···, An}中, 各告警的数目分别 为 nl,n2,…,皿, 告警总数为 N。 使用告警发生的频率近似代替告警发生的概率, g卩: [115] Set the alarm set (or alarm sequence) in the time window Ι = {Α1, Α2, ···, An}, the number of each alarm is nl, n2, ..., the total number of alarms is N. Use the frequency of the occurrence of the alarm to approximate the probability of the alarm occurring, g卩:
[116] w [116] w
[117] 为第 i个告警发生的概率, N为告警总数。 具体如图 3所示, 图 3为本实施例 中使用告警发生的频率近似代替告警发生的概率的示意图。 [117] is the probability that the i-th alarm occurs, and N is the total number of alarms. Specifically, as shown in FIG. 3, FIG. 3 is a schematic diagram of the probability that the occurrence of an alarm is used instead of the probability of occurrence of an alarm in the embodiment.
[118]其次, 计算所述每种告警对应的故障原因发生的概率; 具体包括: 初始化所述 每种告警的故障原因发生的概率;在所述预设时间窗内计算并验证所述每种告警的故 障原因发生的概率。 [119]也就是说, 对于任意告警, 设其存在的 c种故障原因, 初始化每种告警的故障 原因发生的概率 ^ 为: [118] Next, calculating a probability of occurrence of the fault cause corresponding to each type of alarm; specifically: initializing a probability of occurrence of the fault cause of each type of alarm; calculating and verifying each of the types in the preset time window The probability of the cause of the alarm being faulty. [119] That is to say, for any alarm, set the cause of the c faults, and the probability of initializing the fault cause of each alarm ^ is:
1 1
P alarm, =~ ( 0 < Z < 0 < J < C ) 进一步, 所述方法还可以包括: 更新所述每种告警对应的故障原因发生的概率。其更 新故障原因发生的概率的示意图如图 4所示, P alarm, = ~ ( 0 < Z < 0 < J < C ) Further, the method may further include: updating a probability of occurrence of the fault cause corresponding to each alarm. A schematic diagram of the probability of updating the cause of the failure occurs as shown in FIG.
[120]如图 4所示, 从现网的告警数据中分析某一告警产生的原因, 并确定由于哪种 故障原因产生了该告警。 通过故障原因验证后, 反馈给下一次计算确定该告警的故 障原因的过程。 当在一个时间窗内确定由于故障原因 /αί 产生的告警 ator ,., 则安 照如下方式更新告警故障原因:
Figure imgf000015_0001
[120] As shown in FIG. 4, the cause of an alarm is analyzed from the alarm data of the existing network, and it is determined that the alarm is generated due to which fault cause. After the fault is verified, the next step is to calculate the cause of the fault of the alarm. When the alarm ator, . generated by the cause of the fault /αί is determined within a time window, the alarm failure reason is updated as follows:
Figure imgf000015_0001
_ I  _ I
P ._ P other ^ P P ._ P other ^ P
[121]其中, Pal 为每种告警的故障原因发生的概率, /^^^为故障原因 j产生告警 i的概率, 7^™·,。^是除了故障原因 j以外的其它故障原因产生告警 i的概率, C为 常数。 [121] wherein, P al is the probability of occurrence of the failure cause of each type of alarm, /^^^ is the probability of generating the alarm i for the failure cause j, 7^TM·,. ^ is the probability that an alarm i is generated for a cause other than the cause of the failure, and C is a constant.
[122]例如, 设某一告警 a/arm有 3种故障原因/ /^,/ /^/ /^, 则初始化每种故 障原因的概率为; 。„ftl =/∞ft2 =/。„ft3 当在一个时间窗口内确定故障原因为 fault fault faults^: [122] For example, if there are three kinds of fault causes / /^, / /^/ /^ for a certain alarm a/arm, the probability of initializing each fault cause is; „ ftl =/ ∞ft2 =/. „ ft3 When the fault is determined in a time window is fault fault faults^:
Pfauin— + (1— )—: Pfauin— + (1— )—:
3 3 r 9  3 3 r 9
= 1 _ 1 ! 1 = 2 = 1 _ 1 ! 1 = 2
P fauiti - 3 3 3 - 9  P fauiti - 3 3 3 - 9
P faults 3 3 3 9 P faults 3 3 3 9
[123]最后, 根据所述每种告警发生的概率和对应的每种故障原因发生的概率, 确定 所述时间窗内所述告警集合对应的故障原因发生的概率。 [123] Finally, the probability of occurrence of the fault cause corresponding to the alarm set in the time window is determined according to the probability of occurrence of each type of alarm and the probability of occurrence of each corresponding fault cause.
[124]也就是说, 本实例以使用贝叶斯网络方法计算时间窗内告警序列的每种故障原 因的概率为例, 具体包括: [125]设时间窗内的告警序列为 I= {A1, Α2,···, An} , 共 n种不同的告警, 每种告警的 频次为 ni, 告警总数为 N, 即 2 ,. = N。 每种告警分别有 ^...,^中故障原因, 一 共有 fault
Figure imgf000016_0001
w。 则对于故障原因 fault 其发生的概率为: [126] p{faultj |< alarm alarm2, .,alarmn >) = ^ fiPaiarm}
[124] That is to say, the example uses the Bayesian network method to calculate the probability of each fault cause of the alarm sequence in the time window as an example, and specifically includes: [125] Let the alarm sequence in the time window be I={A1, Α2,···, An}, a total of n different alarms, the frequency of each alarm is ni, the total number of alarms is N, ie 2,. N. Each alarm has ^..., ^ the cause of the fault, a total of fault
Figure imgf000016_0001
w. Then the probability of occurrence of the fault cause fault is: [126] p{faultj |< alarm alarm 2 , .,alarm n >) = ^ fiP a i arm}
[127]其中, 表示在时间窗内告警 ator ,.发生的概率, 计算过程详与计算时间窗内 告警集合中每种告警发生的概率相同, 即使用告警发生的频率近似代替告警发生的 概率。 fault j E alarm,表示 fault 是 αΐαητ^的 ς·种原因中的一种, pdarmu表示 告警 a/ar ,中/ 的概率, 计算过程详见与计算所述每种告警对应的故障原因 发生的概率的过程相同, 具体详见上述。 [127] wherein, the probability of occurrence of the alarm ator in the time window, the calculation process is the same as the probability of occurrence of each alarm in the alarm set in the calculation time window, that is, the frequency of occurrence of the alarm is used to replace the probability of occurrence of the alarm. Fault j E alarm, indicating that fault is one of the causes of αΐαητ^, p darmu indicates the probability of alarm a/ar, medium/, and the calculation process is detailed in the calculation of the fault cause corresponding to each alarm. The process of probability is the same, as detailed above.
[128]步骤 105 : 选取满足预设范围的所述故障原因发生的概率作为所述告警集合 的故障原因; [128] Step 105: Select a probability that the failure cause that meets the preset range occurs as a fault cause of the alarm set;
[129]在所有的故障原因中, 服务器选取满足预设范围 (比如选择概率最大的故障 原因,或者选择概率比较大的 3个故障原因,或者选择某一区间内的故障原因等) 的所述故障原因发生的概率作为该告警集合或告警序列的故障原因。 其中, 该预 设范围根据需要动态设置, 比如, 可以是故障原因概率最大的前 10 个中的至少 一个等, 但并不限于此。 [129] Among all the causes of failure, the server selects the above-mentioned predetermined range (such as the cause of the failure with the highest probability of selection, or the selection of three failure causes with a relatively high probability, or the selection of the cause of the failure within a certain interval, etc.) The probability of the failure cause is the cause of the failure of the alarm set or alarm sequence. The preset range is dynamically set according to requirements, for example, it may be at least one of the top 10 with the highest probability of failure, etc., but is not limited thereto.
[130]再比如, 选取故障原因概率最大的前 N (如前 1个, 前 5个, 前 8个或前 10 等, N的值可以需要进行适应性选取) 作为该时间窗内告警集合的故障原因集合 (或者故障原因序列)。 [130] For another example, select the top N with the highest probability of failure (such as the first one, the first five, the first eight or the first 10, etc., the value of N may need to be adaptively selected) as the alarm set in the time window. The set of fault causes (or sequence of fault causes).
[131]步骤 106:对所述告警集合的故障原因进行层级合并分类,实现故障原因的层级 定位。 [131] Step 106: Perform hierarchical hierarchical classification on the fault cause of the alarm set to implement hierarchical positioning of the fault cause.
[132]在该实施例中, 对故障原因进行合并分类, 得到故障原因最大的类别, 计算出 其故障原因, 然后逐级向上计算故障原因, 直至完成故障定位目标。 [132] In this embodiment, the cause of the failure is combined and classified, and the category with the largest cause of the failure is obtained, and the calculation is performed. The cause of the fault, and then calculate the cause of the fault step by step until the fault location target is completed.
[133]其故障层级定位示意图如图 5所示, 如图 5所示, 在已知时间窗内的告警集合 (或告警序列) alarm alarm ^ alarm , 计算和合并分类后的故障原因, 确定 分类后的故障原因; 利用该告警集合或告警序列, 并计算合并为该故障原因的上一级 故障原因。直到定位到故障定位目标。 如果故障定位目标为原始原因, 则直到定位到 原始原因。 [133] The fault hierarchy positioning diagram is shown in Figure 5. As shown in Figure 5, the alarm set (or alarm sequence) in the known time window alarm alarm ^ alarm, calculate and merge the classified fault causes, determine the classification The cause of the failure; use the alarm set or alarm sequence, and calculate the cause of the upper-level failure that is merged into the cause of the failure. Until the fault is located. If the fault location target is the original cause, it is not until the original cause is located.
[134]为了便于理解故障层级定位, 下面以一个实例来说明该过程。 [134] To facilitate understanding of fault level positioning, the process is illustrated by an example below.
[135]在告警集合或告警序列下提取故障初始原因, 对故障初始原因进行合并分类, 得到合并原因 (1 ), 继续合并分类, 得到合并原因 (2), 其对应的故障原因合并分类 示意图如图 6所示。然后按照上述过程中计算所述每种告警对应的故障原因发生的 概率的方法, 首先利用合并原因 (2 )层级, 计算出中该告警集合或告警序列的故障 原因中 "监控设备" 的概率最大, 确定该告警原因的概率为 "监控设备"。 而监控设 备是由 "监控设备配置"、 "监控设备"和 "监控设备线缆"三种原因合并来的, 使用 该告警集合或告警序列, 计算在该层级的故障原因下告警的故障原因(合并分类后的 告警故障原因的概率, 是在该层级下的故障原因概率, 例如, 一个告警可能存在 3 种故障原因, 则每种故障原因的概率为 1, 合并分类后为一种, 概率为 1 ), 计算出 [135] Extracting the initial cause of the fault under the alarm set or the alarm sequence, and classifying the initial cause of the fault, obtaining the reason for the merger (1), continuing to merge the classification, and obtaining the reason for the merger (2), and the corresponding fault cause merge classification diagram Figure 6 shows. Then, according to the method for calculating the probability of occurrence of the fault cause corresponding to each type of alarm in the foregoing process, firstly, using the merging reason (2) level, the probability of "monitoring device" among the fault causes of the alarm set or the alarm sequence is calculated to be the largest. The probability of determining the cause of the alarm is "monitoring device". The monitoring device is composed of three reasons: "monitoring device configuration", "monitoring device" and "monitoring device cable". The alarm set or alarm sequence is used to calculate the cause of the alarm under the cause of the fault at the level ( The probability of the fault cause of the combined classification is the probability of the fault under this level. For example, there may be three fault causes for an alarm. The probability of each fault is 1, and the combined classification is one. The probability is 1), calculate
3  3
该层级的"监控设备配置"的概率最大。如果需要定位到初始原因,使用该告警序列, 按照上述方法继续向上一层计算, 从而确定 "监控设备为插紧"为告警故障原因。 This level of "monitoring device configuration" has the highest probability. If you need to locate the initial cause, use the alarm sequence and continue to calculate the upper layer according to the above method to determine that the "monitoring device is plugged in" is the cause of the alarm failure.
[136]本发明实施例中, 先从告警设计及说明文档中提取告警信息, 然后按照故障 定位目标,对告警故障原因进行合并和分类,并计算出时间窗内的现网中告警集合 (或 告警序列)的故障原因发生的概率, 从而实现对故障原因的层级定位, 提高了故障定 位效率。 进一步, 通过对告警故障原因的合并分类, 也提高了告警的相关性。 以及实 时更新告警的故障原因概率, 提高了告警的故障原因定位的准确度。 In the embodiment of the present invention, the alarm information is extracted from the alarm design and description document, and then the fault occurrence targets are combined and classified according to the fault location target, and the alarm set in the current network in the time window is calculated (or The probability of the occurrence of the fault of the alarm sequence, thereby achieving hierarchical positioning of the fault cause and improving the fault location efficiency. Further, by correlating the causes of alarm failures, the correlation of alarms is also improved. As well as the probability of the failure of the alarm being updated in real time, the accuracy of the fault location of the alarm is improved.
[137]基于上述方法的实现过程, 本发明实施例还提供一种故障定位装置, 其结构示 意图如图 7所示, 所述装置包括: 提取单元 71, 第一建立单元 72, 处理单元 73, 获 取单元 74, 第二建立单元 75, 确定单元 76、 选取单元 77和定位单元 78, 其中, [138]所述提取单元 71, 用于提取网络中各设备的告警信息, 具体可以是从网络中 各设备的告警设计及说明文档中提取告警信息。 [137] Based on the implementation process of the foregoing method, the embodiment of the present invention further provides a fault locating device, which is shown in FIG. 7. The device includes: an extracting unit 71, a first establishing unit 72, and a processing unit 73. The obtaining unit 74, the second establishing unit 75, the determining unit 76, the selecting unit 77 and the positioning unit 78, wherein The extracting unit 71 is configured to extract the alarm information of each device in the network, and specifically, extract the alarm information from the alarm design and description documents of each device in the network.
[139]所述第一建立单元 72, 用于建立所述告警信息中告警名称和告警故障原因之间 的对应关系; 具体用于建立所述告警集合和所述告警故障原因集合的二分图,其二分 图的定义详见上述, 在此不再赘述。 The first establishing unit 72 is configured to establish a correspondence between the alarm name and the alarm fault cause in the alarm information, and specifically configured to establish a bipartite graph of the alarm set and the alarm fault cause set. The definition of the bipartite graph is as described above, and will not be described here.
[140]所述处理单元 73, 用于按照故障定位目标对所述告警故障原因进行合并分类, 得到告警故障原因集合; [140] The processing unit 73 is configured to perform a combined classification on the alarm fault cause according to the fault location target, to obtain an alarm fault cause set.
[141]可选的, 所述处理单元包括: 设定单元和分类单元, 所述设定单元, 用于按照 设备故障定位原则,设定故障定位目标;所述分类单元,用于按照所述故障定位目标, 对所述告警故障原因进行合并分类, 得到告警故障原因集合。 [1] Optionally, the processing unit includes: a setting unit and a classification unit, where the setting unit is configured to set a fault location target according to a device fault location principle; and the classification unit is configured to follow The fault locating target combines and classifies the alarm fault causes to obtain a set of alarm fault causes.
[142]所述获取单元 74, 用于获取现网中告警数据的告警集合; [142] The obtaining unit 74 is configured to acquire an alarm set of alarm data in the live network.
[143]所述第二建立单元 75, 用于建立所述告警集合和所述告警故障原因集合的对应 关系; 其中, 所述告警集包括: 告警名称集合; [143] The second establishing unit 75 is configured to establish a correspondence between the alarm set and the alarm failure cause set, where the alarm set includes: a set of alarm names;
[144]所述确定单元 76, 用于确定预设时间窗内所述告警集合对应的故障原因发生的 概率; The determining unit 76 is configured to determine a probability of occurrence of a fault cause corresponding to the alarm set in a preset time window;
[145]可选的, 所述确定单元包括: 第一概率计算单元, 第二概率计算单元和概率确 定单元,所述第一概率计算单元,用于计算预设时间窗内告警集合中每种告警发生的 概率, 具体用于使用预设时间窗内的告警发生的频次作为时间窗内告警发生的概率; 所述第二概率计算单元,用于计算所述每种告警对应的故障原因发生的概率; 所述概 率确定单元, 用于根据所述每种告警发生的概率和对应的每种故障原因发生的概率, 确定所述预设时间窗内所述告警集合对应的故障原因发生的概率。 [145] Optionally, the determining unit includes: a first probability calculating unit, a second probability calculating unit, and a probability determining unit, where the first probability calculating unit is configured to calculate each of the preset time window alarm sets The probability of occurrence of the alarm is specifically used to use the frequency of occurrence of the alarm in the preset time window as the probability of occurrence of the alarm in the time window; the second probability calculation unit is configured to calculate the occurrence of the fault corresponding to each alarm The probability determining unit is configured to determine, according to the probability of occurrence of each type of alarm and the probability of occurrence of each fault cause, the probability of occurrence of the fault cause corresponding to the alarm set in the preset time window.
[146]可选的, 所述第二概率计算单元包括: 初始化单元和验证单元, 所述初始化 单元, 用于初始化所述每种告警的故障原因发生的概率; 所述验证单元, 用于在 所述时间窗内计算并验证所述每种告警的故障原因发生的概率; 进一步, 还可以 包括: 更新单元, 用于更新所述每种告警的对应的故障原因发生的概率。 [146] Optionally, the second probability calculation unit includes: an initialization unit and a verification unit, where the initialization unit is configured to initialize a probability that a fault cause of each alarm occurs; the verification unit is configured to The time window calculates and verifies the probability of occurrence of the fault cause of each type of alarm. Further, the method further includes: an update unit, configured to update a probability of occurrence of a corresponding fault cause of each type of alarm.
[147]可选的, 所述分类单元, 还用于在按照所述故障定位目标, 对所述告警故障 原因进行合并分类后, 还得到告警名称集; 所述装置还包括: 计算单元, 用于计 算合并分类后的所述告警名称集的关联度, 所述关联度用于验证所述合并分类的 有效性。 [147] Optionally, the classification unit is further configured to: perform the alarm failure according to the fault location target After the categorization is performed, the alarm name set is also obtained. The device further includes: a calculating unit, configured to calculate a correlation degree of the alarm name set after the merging, the correlation degree is used to verify that the merged category is valid. Sex.
[148]所述选择单元 77, 用于选取满足预设范围的所述故障原因发生的概率作为所 述告警集合对应的故障原因发生的概率。 [148] The selecting unit 77 is configured to select a probability that the fault cause that meets the preset range occurs as a probability of occurrence of a fault cause corresponding to the alarm set.
[149]所述定位单元 78, 用于对所述告警集合的故障原因进行层级合并分类, 实 现故障原因的层级定位。 The locating unit 78 is configured to perform hierarchical hierarchical classification on the fault cause of the alarm set, and implement hierarchical positioning of the fault cause.
[150]可选的, 所述定位单元包括: 层级分类单元和层级定位单元, 其中, 所述层 级分类单元, 用于对所述告警集合的故障原因进行层级合并分类, 得到各层的合 并原因; 所述层级定位单元, 用于逐级向上层计算出故障原因, 以完成故障定位 目标的层级定位。 [150] Optionally, the positioning unit includes: a hierarchical classification unit and a hierarchical positioning unit, where the hierarchical classification unit is configured to perform hierarchical hierarchical classification on the fault causes of the alarm set, and obtain a combination reason of each layer. The hierarchical positioning unit is configured to calculate the cause of the fault step by step to complete the hierarchical positioning of the fault location target.
[151]可选的所述装置可以集成在终端中, 也可以独立部署, 本实施例不作限制。 [151] The optional device may be integrated in the terminal, or may be deployed independently. This embodiment is not limited.
[152]所述装置中各个单元的功能和作用的实现过程,详见上述方法中对应步骤的 实现过程, 在此不再赘述。 [153]相应的,本发明实施例还提供一种故障定位装置,其结构示意图如图 8所示, 所述装置包括: 告警信息提取单元 81, 告警信息处理单元 82, 告警数据处理单 元 83和故障层级定位单元 84, 其中, [152] For the implementation process of the functions and functions of the various units in the device, refer to the implementation process of the corresponding steps in the foregoing method, and details are not described herein again. [153] Correspondingly, the embodiment of the present invention further provides a fault locating device, which is shown in FIG. 8. The device includes: an alarm information extracting unit 81, an alarm information processing unit 82, an alarm data processing unit 83, and Fault hierarchy locating unit 84, wherein
[154]所述告警信息提取单元 81, 用于网络中各设备的告警信息, 建立所述告警 信息中的告警名称和告警故障原因之间的对应关系; [155]所述告警信息处理单元 82, 用于按照故障定位目标对所述告警故障原因进 行合并分类, 得到告警故障原因集合; [154] The alarm information extraction unit 81 is configured to generate alarm information of each device in the network, and establish a correspondence between the alarm name and the alarm failure cause in the alarm information. [155] The alarm information processing unit 82 And merging and classifying the alarm fault causes according to the fault locating target, and obtaining an alarm fault cause set;
[156]所述告警数据处理单元 83, 用于获取告警数据的告警集合, 建立所述告警 集合和所述告警故障原因集合的对应关系; 其中, 所述告警集合包括: 告警名称 集合; 确定预设时间窗内所述告警集合对应的故障原因发生的概率; 选取满足预 设范围的所述故障原因发生的概率作为所述告警集合的故障原因; [156] The alarm data processing unit 83 is configured to acquire an alarm set of alarm data, and establish a correspondence between the alarm set and the alarm failure cause set; wherein the alarm set includes: an alarm name set; Setting a probability that the fault cause corresponding to the alarm set in the time window occurs; selecting a probability that the fault cause occurs in the preset range as a fault cause of the alarm set;
[157]所述故障层级定位单元 84, 用于对所述告警集合的故障原因进行层级合并 分类, 实现故障原因的层级定位。 [157] The fault hierarchy locating unit 84 is configured to perform hierarchical merging on the fault cause of the alarm set. Classification, achieving hierarchical positioning of the cause of the failure.
[158]可选的, 所述告警信息处理单元 82, 具体用于按照设备故障定位原则, 设 定故障定位目标; 按照所述故障定位目标对所述告警故障原因进行合并分类, 得 到告警故障原因集合。 [159]可选的, 所述告警信息处理单元 82在按照所述故障定位目标, 对所述告警 故障原因进行合并分类, 得到合并分类后的所述告警名称集时, 还用于计算合并 分类后的所述告警名称集的关联度, 所述关联度用于验证所述合并分类的有效 性。 [158] Optionally, the alarm information processing unit 82 is specifically configured to: set a fault location target according to a device fault location principle; perform the combined classification of the alarm fault cause according to the fault location target, and obtain an alarm fault cause. set. [159] Optionally, the alarm information processing unit 82 combines and classifies the alarm failure causes according to the fault location target, and obtains the combined classification of the alarm name sets, and is also used to calculate the combined classification. The degree of association of the subsequent alarm name set, the correlation degree is used to verify the validity of the merged classification.
[160]可选的, 所述告警数据处理单元 83建立所述告警集合和所述告警故障原因 集合的对应关系, 包括: 根据所述告警名称和告警故障原因之间的对应关系, 建 立所述告警集合和所述告警故障原因集合的二分图。 [160] Optionally, the alarm data processing unit 83 establishes a correspondence between the alarm set and the alarm failure cause set, and includes: establishing, according to a correspondence between the alarm name and an alarm failure cause, A bipartite graph of the set of alarms and the set of alarm failure causes.
[161]可选的, 所述告警数据处理单元 83确定预设时间窗内所述告警集合对应的 故障原因发生的概率, 包括: 计算时间窗内告警集合中每种告警发生的概率; 计 算所述每种告警对应的故障原因发生的概率; 根据所述每种告警发生的概率和对 应的每种故障原因发生的概率, 确定所述时间窗内所述告警集合对应的故障原因 发生的概率; 选取满足预设范围的所述故障原因发生的概率作为所述告警集合的 故障原因。 [016] Optionally, the alarm data processing unit 83 determines a probability of occurrence of a fault cause corresponding to the alarm set in the preset time window, and includes: calculating a probability of occurrence of each alarm in the alarm set in the time window; Determining a probability of occurrence of a fault cause corresponding to each type of alarm; determining a probability of occurrence of a fault cause corresponding to the alarm set in the time window according to a probability of occurrence of each type of alarm and a probability of occurrence of each type of fault cause; The probability of occurrence of the fault cause that meets the preset range is selected as the fault cause of the alarm set.
[162]可选的, 所述告警数据处理单元 83计算预定时间窗内告警集合中每种告警 发生的概率, 包括: 使用预定时间窗内的告警发生的频次作为预定时间窗内告警 发生的概率。 [162] Optionally, the alarm data processing unit 83 calculates a probability of occurrence of each alarm in the alarm set in the predetermined time window, including: using a frequency of occurrence of the alarm within the predetermined time window as a probability of occurrence of the alarm within the predetermined time window .
[163]可选的, 所述告警数据处理单元 83计算所述每种告警对应的故障原因发生 的概率, 包括: 初始化所述每种告警的故障原因发生的概率; 在所述预定时间窗 内计算并验证所述每种告警的故障原因发生的概率。 [163] Optionally, the alarm data processing unit 83 calculates a probability of occurrence of the fault cause corresponding to each type of alarm, including: initializing a probability of occurrence of the fault cause of each alarm; within the predetermined time window Calculate and verify the probability of occurrence of the failure cause of each of the alarms.
[164]可选的, 所述告警数据处理单元 83计算所述每种告警对应的故障原因发生 的概率, 还包括: 更新所述每种告警的对应的故障原因发生的概率。 [164] Optionally, the alarm data processing unit 83 calculates a probability of occurrence of the fault cause corresponding to each type of alarm, and further includes: updating a probability of occurrence of a corresponding fault cause of each type of the alarm.
[165]可选的, 所述告警数据处理单元 83确定所述时间窗内所述告警集合对应的 故障原因发生的概率, 包括: 选取最大故障原因发生的概率或者选取满足预设范 围的所述故障原因发生的概率作为所述告警集合的故障原因。 [165] Optionally, the alarm data processing unit 83 determines a probability of occurrence of a fault cause corresponding to the alarm set in the time window, and includes: selecting a probability that a maximum fault cause occurs or selecting a preset preset criterion The probability that the cause of the fault occurs is the cause of the fault of the alarm set.
[166]可选的, 所述故障层级定位单元 84, 具体用于对所述告警集合的故障原因 进行层级合并分类, 得到各层的合并原因; 逐级向上层计算出故障原因, 以完成 故障定位目标的层级定位。 [167]可选的所述装置可以集成在终端中, 也可以独立部署, 本实施例不作限制。 [166] Optionally, the fault tier locating unit 84 is configured to perform hierarchical merging and categorizing the fault causes of the alarm set to obtain a merging reason of each layer; calculate a fault cause to the upper tier to complete the fault. The hierarchical positioning of the target. The optional device may be integrated in the terminal, or may be deployed independently. This embodiment is not limited.
[168]所述装置中各个单元的功能和作用的实现过程详见上述方法中对应步骤的 实现过程在此不再赘述。 [168] The implementation process of the functions and functions of the various units in the device is described in detail in the implementation process of the corresponding steps in the above method.
[169]相应的, 本发明实施例还提供一种服务器, 其结构示意图如图 9所示, 所述 服务器 9包括: 存储器 91、 收发器 92和处理器 93, 其中, [170]所述存储器 91, 用于存储网络中各设备的告警设计及说明文档; Correspondingly, the embodiment of the present invention further provides a server. The structure of the server is as shown in FIG. 9. The server 9 includes: a memory 91, a transceiver 92, and a processor 93, wherein [170] the memory 91, used for storing alarm design and description documents of each device in the network;
[171]所述收发器 92, 用于获取所述存储器 81存储网络中各设备的告警设计及说 明文档, 并从告警设计及说明文档中提取告警信息; The transceiver 92 is configured to acquire an alarm design and description file of each device in the storage 81, and extract alarm information from the alarm design and description document;
[172]所述处理器 93, 用于建立所述告警信息中的告警名称和告警故障原因之间 的对应关系; 按照故障定位目标, 对所述告警故障原因进行合并分类, 得到告警 故障原因集合; The processor 93 is configured to establish a correspondence between the alarm name and the alarm fault cause in the alarm information. According to the fault location target, the alarm fault causes are combined and classified to obtain an alarm fault cause set. ;
[173]所述收发器 92, 还用于获取现网中告警数据的告警集合; The transceiver 92 is further configured to acquire an alarm set of alarm data in the live network.
[174]所述处理器 93, 还用于建立所述告警集合和所述告警故障原因集合的对应 关系; 其中, 所述告警集合包括: 告警名称集合; 确定预设时间窗内所述告警集 合对应的故障原因发生的概率; 选取满足预设范围的所述故障原因发生的概率作 为所述告警集合的故障原因; 对所述告警集合的故障原因进行层级合并分类, 实 现故障原因的层级定位。 The processor 93 is further configured to establish a correspondence between the alarm set and the alarm failure cause set, where the alarm set includes: an alarm name set; and determining the alarm set in a preset time window. The probability of occurrence of the corresponding fault cause is selected; the probability of occurrence of the fault cause that meets the preset range is selected as the fault cause of the alarm set; the fault cause of the alarm set is hierarchically classified to achieve hierarchical positioning of the fault cause.
[175]可选的, 所述处理器按照故障定位目标, 对所述告警故障原因合并分类, 得 到告警故障原因集合包括: 按照设备故障定位原则, 设定故障定位目标; 按照所 述故障定位目标, 对所述告警故障原因进行合并分类, 得到告警故障原因集合。 [176]可选的,所述处理器按照所述故障定位目标对所述告警故障原因进行合并分 类, 还得到合并分类后的所述告警名称集; 所述处理器还用于计算合并分类后的 所述告警名称集的关联度, 所述关联度用于验证所述合并分类的有效性。 [175] Optionally, the processor combines the alarm fault causes according to the fault location target, and obtains the alarm fault cause set, including: setting a fault location target according to the device fault location principle; And combining the alarm fault causes to obtain a set of alarm fault causes. [176] Optionally, the processor combines the alarm failure causes according to the fault location target And the processor is further configured to calculate the association degree of the alarm name set after the combination classification, where the correlation degree is used to verify the validity of the merged classification.
[177]可选的, 所述处理器建立所述告警集合和所述告警故障原因集合的对应关 系, 包括: 根据所述告警名称和告警故障原因之间的对应关系, 建立所述告警集 合和所述告警故障原因集合的二分图。 [177] Optionally, the establishing, by the processor, the corresponding relationship between the alarm set and the alarm fault cause set, the method includes: establishing, according to a correspondence between the alarm name and an alarm fault cause, the alarm set and A bipartite graph of the set of alarm failure causes.
[178]可选的,所述处理器确定预设时间窗内所述告警集合对应的故障原因发生的 概率, 包括: 计算预设时间窗内告警集合中每种告警发生的概率; 计算所述每种 告警对应的故障原因发生的概率; 根据所述每种告警发生的概率和对应的每种故 障原因发生的概率, 确定所述预设时间窗内所述告警集合对应的故障原因发生的 概率, 选取满足预设范围的所述故障原因发生的概率作为所述告警集合的故障原 因;。 [178] Optionally, the determining, by the processor, a probability that the fault cause corresponding to the alarm set in the preset time window occurs, including: calculating a probability of occurrence of each alarm in the alarm set in the preset time window; The probability of occurrence of the fault cause corresponding to each type of alarm; determining the probability of occurrence of the fault cause corresponding to the alarm set in the preset time window according to the probability of occurrence of each type of alarm and the probability of occurrence of each type of fault cause And selecting, according to the preset range, a probability that the fault cause occurs as a fault cause of the alarm set;
[179]可选的, 所述处理器计算预定时间窗内告警集合中每种告警发生的概率包 括: 使用预定时间窗内的告警发生的频次作为预定时间窗内告警发生的概率。 Optionally, the calculating, by the processor, the probability of occurrence of each type of alarm in the set of alarms in the predetermined time window comprises: using a frequency of occurrence of the alarm within the predetermined time window as a probability of occurrence of the alarm within the predetermined time window.
[180]可选的, 所述处理器计算所述每种告警对应的故障原因发生的概率包括: 初 始化所述每种告警的故障原因发生的概率; 在所述预定时间窗内计算并验证所述 每种告警的故障原因发生的概率。 [180] Optionally, the calculating, by the processor, the probability of occurrence of the fault cause corresponding to each type of alarm includes: initializing a probability of occurrence of the fault cause of each type of alarm; calculating and verifying the location within the predetermined time window The probability of the cause of the failure of each type of alarm.
[181]可选的, 所述处理器计算所述每种告警对应的故障原因发生的概率还包括: 更新所述每种告警的对应的故障原因发生的概率。 [181] Optionally, the calculating, by the processor, the probability of occurrence of the fault cause corresponding to each type of alarm further includes: updating a probability of occurrence of a corresponding fault cause of each type of alarm.
[182]可选的,所述处理器确定所述预设时间窗内所述告警集合对应的故障原因发 生的概率, 包括: 选取最大故障原因发生的概率或者选取满足预设范围的所述故 障原因发生的概率作为所述告警集合的故障原因。 [182] Optionally, the determining, by the processor, a probability of occurrence of a fault cause corresponding to the alarm set in the preset time window, including: selecting a probability of occurrence of a maximum fault cause or selecting the fault that meets a preset range The probability of occurrence of the cause is the cause of the failure of the alarm set.
[183]可选的, 所述处理器对所述告警集合的故障原因进行层级合并分类, 实现故 障原因的层级定位, 包括: 对所述告警集合的故障原因进行层级合并分类, 得到 各层的合并原因;逐级向上层计算出故障原因, 以完成故障定位目标的层级定位。 [184]相应的, 本发明实施例还提供一种终端, 所述终端包括: 服务器, 所述服务 器如上述的服务器, 包括: 存储器和处理器, 所述存储器和处理器的功能和作用 详见上述, 在此不再赘述。 [185]在本发明实施例中, UE 可以为以下任意一种, 可以是静态的, 也可以是移 动的, 静止的 UE具体可以包括为终端 (terminal )、 移动台 (mobi le station ) , 用户单元 (subscriber unit ) 或站台 (station ) 等, 移动的 UE 具体可以包括 蜂窝电话(cel lular phone )、个人数字助理(PDA , personal digital assi stant ) ^ 调制解调器(modem) ,无线通信设备、手持设备(handhel d)、笔记本电脑(laptop computer ) ^ 无绳电话 ( cordless phone )或无线本地环路 ( WLL, wire less local loop ) 台等, 上述 UE可以分布于整个无线网络中。 [183] Optionally, the processor performs hierarchical hierarchical classification on the fault cause of the alarm set, and implements hierarchical positioning of the fault cause, including: hierarchically merging and categorizing the fault causes of the alarm set, and obtaining layers of each The reason for the combination; the cause of the failure is calculated step by step to the upper layer to complete the hierarchical positioning of the fault location target. Correspondingly, the embodiment of the present invention further provides a terminal, where the terminal includes: a server, the server, such as the server, includes: a memory and a processor, and functions and functions of the memory and the processor are detailed. The above is not described here. In the embodiment of the present invention, the UE may be any one of the following, and may be static or mobile. The static UE may specifically be a terminal, a mobile station (mobi le station), and a user. For a subscriber unit or a station, the mobile UE may specifically include a cel lular phone, a personal digital assistant (PDA), a modem, a wireless communication device, and a handheld device ( Handhel d), laptop computer ^ cordless phone or wire less local loop (WLL) station, etc., the above UEs can be distributed throughout the wireless network.
[186]需要说明的是, 在本文中, 诸如第一和第二等之类的关系术语仅仅用来将一 个实体或者操作与另一个实体或操作区分开来, 而不一定要求或者暗示这些实体 或操作之间存在任何这种实际的关系或者顺序。 而且, 术语 "包括" 、 "包含" 或者其任何其他变体意在涵盖非排他性的包含, 从而使得包括一系列要素的过 程、 方法、 物品或者设备不仅包括那些要素, 而且还包括没有明确列出的其他要 素, 或者是还包括为这种过程、 方法、 物品或者设备所固有的要素。 在没有更多 限制的情况下, 由语句 "包括一个…… " 限定的要素, 并不排除在包括所述要素 的过程、 方法、 物品或者设备中还存在另外的相同要素。 [186] It should be noted that, in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. There is any such actual relationship or order between operations. Furthermore, the terms "including", "comprising" or "comprising" or "includes" or "includes" or "includes" or "includes" or "includes" Other elements, or elements that are inherent to such a process, method, item, or device. In the absence of more limitations, the elements defined by the phrase "comprising a ..." do not exclude the presence of additional identical elements in the process, method, article or device that comprises the element.
[187]通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本发明可 借助软件加必需的通用硬件平台的方式来实现, 当然也可以通过硬件, 但很多情 况下前者是更佳的实施方式。 基于这样的理解, 本发明的技术方案本质上或者说 对现有技术做出贡献的部分可以以软件产品的形式体现出来, 该计算机软件产品 可以存储在存储介质中, 如 R0M/RAM、 磁碟、 光盘等, 包括若干指令用以使得一 台计算机设备 (可以是个人计算机, 服务器, 或者网络设备等) 执行本发明各个 实施例或者实施例的某些部分所述的方法。 Through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases the former is more Good implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM, a disk. , an optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention or portions of the embodiments.
[188]以上所述仅是本发明的优选实施方式, 应当指出, 对于本技术领域的普通技 术人员来说, 在不脱离本发明原理的前提下, 还可以作出若干改进和润饰, 这些 改进和润饰也应视为本发明的保护范围。 The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can make several improvements and refinements without departing from the principles of the present invention. Retouching should also be considered as the scope of protection of the present invention.

Claims

权 利 要 求 Rights request
1、 一种故障定位方法, 其特征在于, 包括: 1. A fault location method, characterized by including:
提取网络中各设备的告警信息,建立所述告警信息中的告警名称和告警故障 原因之间的对应关系; Extract the alarm information of each device in the network, and establish the corresponding relationship between the alarm name in the alarm information and the cause of the alarm failure;
按照故障定位目标对所述告警故障原因进行合并分类,得到告警故障原因集 合. The alarm fault causes are combined and classified according to the fault location target to obtain a set of alarm fault causes.
获取现网中告警数据的告警集合,建立所述告警集合和所述告警故障原因集 合的对应关系; 其中, 所述告警集合包括: 告警名称集合; Obtain an alarm set of alarm data in the existing network, and establish a corresponding relationship between the alarm set and the alarm failure cause set; wherein, the alarm set includes: an alarm name set;
确定预定时间窗内所述告警集合对应的故障原因发生的概率; Determine the probability of occurrence of the fault cause corresponding to the alarm set within the predetermined time window;
选取满足预设范围的所述故障原因发生的概率作为所述告警集合的故障原 因; Select the probability of occurrence of the fault cause that satisfies the preset range as the fault cause of the alarm set;
对所述告警集合的故障原因进行层级合并分类, 实现故障原因的层级定位。 The fault causes of the alarm set are hierarchically merged and classified to realize hierarchical positioning of the fault causes.
2、 根据权利要求 1所述的方法, 其特征在于, 所述按照故障定位目标, 对 所述告警故障原因合并分类, 得到告警故障原因集合包括: 2. The method according to claim 1, characterized in that, according to the fault location target, the alarm fault causes are combined and classified, and the obtained alarm fault cause set includes:
按照设备故障定位原则, 设定故障定位目标; Set fault location targets according to the principle of equipment fault location;
按照所述故障定位目标, 对所述告警故障原因进行合并分类, 得到告警故障 原因集合。 According to the fault location target, the alarm fault causes are combined and classified to obtain a set of alarm fault causes.
3、 根据权利要求 2所述的方法, 其特征在于, 所述按照所述故障定位目标 对所述告警故障原因进行合并分类, 还得到告警名称集; 所述方法还包括: 计算合所述告警名称集的关联度,所述关联度用于验证所述合并分类的有效 性。 3. The method according to claim 2, characterized in that, the alarm failure causes are combined and classified according to the fault location target, and an alarm name set is obtained; the method further includes: calculating and summarizing the alarms The correlation degree of the name set, the correlation degree is used to verify the validity of the merged classification.
4、 根据权利要求 1所述的方法, 其特征在于, 所述建立所述告警集合和所 述告警故障原因集合的对应关系, 包括: 4. The method according to claim 1, characterized in that establishing the corresponding relationship between the alarm set and the alarm failure cause set includes:
根据所述告警名称和告警故障原因之间的对应关系,建立所述告警集合和所 述告警故障原因集合的二分图。 According to the corresponding relationship between the alarm name and the alarm failure cause, a bipartite graph of the alarm set and the alarm failure cause set is established.
5、 根据权利要求 1所述的方法, 其特征在于, 所述确定预定时间窗内所述 告警集合对应的故障原因发生的概率, 包括: 计算预定时间窗内告警集合中每种告警发生的概率; 5. The method according to claim 1, wherein the determining the probability of occurrence of the fault cause corresponding to the alarm set within the predetermined time window includes: Calculate the probability of occurrence of each alarm in the alarm set within the predetermined time window;
计算所述每种告警对应的故障原因发生的概率; Calculate the probability of occurrence of the fault cause corresponding to each alarm;
根据所述每种告警发生的概率和对应的每种故障原因发生的概率,确定所述 预定时间窗内所述告警集合对应的故障原因发生的概率。 According to the probability of occurrence of each alarm and the corresponding probability of occurrence of each fault cause, the probability of occurrence of the fault cause corresponding to the alarm set within the predetermined time window is determined.
6、 根据权利要求 5所述的方法, 其特征在于, 所述计算预定时间窗内告警 集合中每种告警发生的概率, 包括: 6. The method according to claim 5, characterized in that the calculation of the probability of occurrence of each alarm in the alarm set within the predetermined time window includes:
使用预定时间窗内的告警发生的频次作为预设时间窗内告警发生的概率。 Use the frequency of alarm occurrences within the predetermined time window as the probability of alarm occurrence within the preset time window.
7、 根据权利要求 5所述的方法, 其特征在于, 所述计算所述每种告警对应 的故障原因发生的概率, 包括: 7. The method according to claim 5, wherein the calculating the probability of occurrence of the fault cause corresponding to each alarm includes:
初始化所述每种告警的故障原因发生的概率; Initialize the probability of occurrence of the fault cause of each alarm;
在所述预定时间窗内计算并验证所述每种告警的故障原因发生的概率; 更新所述每种告警对应的故障原因发生的概率。 Calculate and verify the probability of occurrence of the fault cause for each alarm within the predetermined time window; and update the probability of occurrence of the fault cause corresponding to each alarm.
8、 根据权利要求 1至 7任一项所述的方法, 其特征在于, 所述对所述告警 集合的故障原因进行层级合并分类, 实现故障原因的层级定位, 包括: 8. The method according to any one of claims 1 to 7, characterized in that, hierarchically merging and classifying the fault causes of the alarm set to achieve hierarchical positioning of fault causes, including:
对所述告警集合的故障原因进行层级合并分类, 得到各层的合并原因; 逐级向上层计算出故障原因, 以完成故障定位目标的层级定位。 The fault causes of the alarm set are hierarchically merged and classified to obtain the merged causes of each layer; the fault causes are calculated to the upper layers step by step to complete the hierarchical positioning of the fault location target.
9、 一种故障定位装置, 其特征在于, 包括: 9. A fault locating device, characterized by including:
提取单元, 用于提取网络中各设备的告警信息; Extraction unit, used to extract alarm information from each device in the network;
第一建立单元,用于建立所述告警信息中告警名称和告警故障原因之间的对 应关系; The first establishment unit is used to establish the correspondence between the alarm name and the cause of the alarm failure in the alarm information;
处理单元, 用于按照故障定位目标对所述告警故障原因进行合并分类, 得到 告警故障原因集合; A processing unit, configured to merge and classify the alarm fault causes according to the fault location target to obtain a set of alarm fault causes;
获取单元, 用于获取现网中告警数据的告警集合; The acquisition unit is used to obtain the alarm collection of alarm data in the existing network;
第二建立单元, 用于建立所述告警集合和所述告警故障原因集合的对应关 系; 其中, 所述告警集包括: 告警名称集合; The second establishment unit is used to establish the corresponding relationship between the alarm set and the alarm failure cause set; wherein the alarm set includes: an alarm name set;
确定单元, 用于确定预定时间窗内所述告警集合对应的故障原因发生的概 率; Determining unit, used to determine the probability of occurrence of the fault cause corresponding to the alarm set within the predetermined time window;
选取单元,用于选取满足预设范围的所述故障原因发生的概率作为所述告警 集合的故障原因; A selection unit configured to select a probability of occurrence of the fault cause that satisfies a preset range as the alarm. The cause of the failure of the collection;
定位单元, 用于对所述告警集合的故障原因进行层级合并分类, 实现故障原 因的层级定位。 The positioning unit is used to hierarchically merge and classify the fault causes of the alarm set to realize hierarchical positioning of the fault causes.
10、 根据权利要求 9所述的装置, 其特征在于, 所述处理单元包括: 设定单元, 用于按照设备故障定位原则, 设定故障定位目标; 10. The device according to claim 9, characterized in that the processing unit includes: a setting unit, used to set the fault location target according to the equipment fault location principle;
分类单元,用于按照所述故障定位目标,对所述告警故障原因进行合并分类, 得到告警故障原因集合。 A classification unit, configured to combine and classify the alarm fault causes according to the fault location target to obtain a set of alarm fault causes.
11、 根据权利要求 10所述的装置, 其特征在于, 11. The device according to claim 10, characterized in that,
所述分类单元, 还用于在按照所述故障定位目标, 对所述告警故障原因进行 合并分类后, 还得到告警名称集; 所述装置还包括: The classification unit is also configured to obtain a set of alarm names after merging and classifying the alarm fault causes according to the fault location target; the device further includes:
计算单元, 用于计算合并分类后的所述告警名称集的关联度, 所述关联度用 于验证所述合并分类的有效性。 The calculation unit is used to calculate the correlation degree of the merged and classified alarm name sets, and the correlation degree is used to verify the validity of the merged classification.
12、 根据权利要求 9所述的装置, 其特征在于, 所述第一建立单元, 具体用 于根据所述告警名称和告警故障原因之间的对应关系,建立所述告警集合和所述 告警故障原因集合的二分图。 12. The device according to claim 9, wherein the first establishing unit is specifically configured to establish the alarm set and the alarm fault according to the corresponding relationship between the alarm name and the alarm fault cause. Bipartite graph of cause sets.
13、 根据权利要求 9所述的装置, 其特征在于, 所述确定单元包括: 第一概率计算单元, 用于计算预定时间窗内告警集合中每种告警发生的概 率; 13. The device according to claim 9, characterized in that the determining unit includes: a first probability calculation unit, used to calculate the probability of occurrence of each alarm in the alarm set within the predetermined time window;
第二概率计算单元, 用于计算所述每种告警对应的故障原因发生的概率; 概率确定单元,用于根据所述每种告警发生的概率和对应的每种故障原因发 生的概率, 确定所述预定时间窗内所述告警集合对应的故障原因发生的概率。 The second probability calculation unit is used to calculate the probability of occurrence of the fault cause corresponding to each alarm; the probability determination unit is used to determine the probability of occurrence of each alarm according to the probability of occurrence of each alarm and the corresponding probability of each fault cause. The probability of occurrence of the fault cause corresponding to the alarm set within the predetermined time window.
14、 根据权利要求 13所述的装置, 其特征在于, 14. The device according to claim 13, characterized in that,
所述第一概率计算单元,具体用于使用预定时间窗内的告警发生的频次作为 预设时间窗内告警发生的概率。 The first probability calculation unit is specifically configured to use the frequency of alarm occurrence within the predetermined time window as the probability of alarm occurrence within the preset time window.
15、 根据权利要求 13所述的装置, 其特征在于, 所述第二概率计算单元包 括: 初始化单元, 用于初始化所述每种告警的故障原因发生的概率; 验证单元,用于在所述预定时间窗内计算并验证所述每种告警的故障原因发 生的概率; 15. The device according to claim 13, characterized in that the second probability calculation unit includes: An initialization unit, used to initialize the probability of occurrence of the fault cause of each alarm; a verification unit, used to calculate and verify the probability of occurrence of the fault cause of each alarm within the predetermined time window;
更新单元, 用于更新所述每种告警的对应的故障原因发生的概率。 An update unit, configured to update the probability of occurrence of the corresponding fault cause for each alarm.
16、 根据权利要求 9至 15任一项所述的装置, 其特征在于, 所述定位单元 包括: 16. The device according to any one of claims 9 to 15, characterized in that the positioning unit includes:
层级分类单元, 用于对所述告警集合的故障原因进行层级合并分类, 得到各 层的合并原因; A hierarchical classification unit, used for hierarchically merging and classifying the fault causes of the alarm set to obtain the merging causes of each layer;
层级定位单元, 用于逐级向上层计算出故障原因, 以完成故障定位目标的层 级定位。 The hierarchical positioning unit is used to calculate the fault cause step by step to the upper level to complete the hierarchical positioning of the fault location target.
17、 一种故障定位装置, 其特征在于, 包括: 17. A fault locating device, characterized by including:
告警信息提取单元, 用于提取网络中各设备的告警信息, 建立所述告警信息 中的告警名称和告警故障原因之间的对应关系; An alarm information extraction unit is used to extract alarm information from each device in the network, and establish a correspondence between the alarm name in the alarm information and the cause of the alarm failure;
告警信息处理单元,用于按照故障定位目标对所述告警故障原因进行合并分 类, 得到告警故障原因集合; An alarm information processing unit is used to merge and classify the alarm fault causes according to the fault location target to obtain a set of alarm fault causes;
告警数据处理单元, 用于获取告警数据的告警集合, 建立所述告警集合和所 述告警故障原因集合的对应关系; 其中, 所述告警集合包括: 告警名称集合; 确 定预定时间窗内所述告警集合对应的故障原因发生的概率;选取满足预设范围的 所述故障原因发生的概率作为所述告警集合的故障原因; An alarm data processing unit is used to obtain an alarm set of alarm data, and establish a correspondence between the alarm set and the alarm failure cause set; wherein the alarm set includes: an alarm name set; determining the alarm within a predetermined time window The probability of occurrence of the fault cause corresponding to the set; select the probability of occurrence of the fault cause that satisfies the preset range as the fault cause of the alarm set;
故障层级定位单元, 用于对所述告警集合的故障原因进行层级合并分类, 实 现故障原因的层级定位。 The fault hierarchical positioning unit is used to hierarchically merge and classify the fault causes of the alarm set to realize hierarchical positioning of the fault causes.
18、 根据权利要求 17所述的装置, 其特征在于, 所述告警信息处理单元, 具体用于按照设备故障定位原则, 设定故障定位目标; 按照所述故障定位目标对 所述告警故障原因进行合并分类, 得到告警故障原因集合。 18. The device according to claim 17, characterized in that the alarm information processing unit is specifically configured to set a fault location target according to the equipment fault location principle; and determine the cause of the alarm fault according to the fault location target. Combine the categories to obtain a set of alarm fault causes.
19、 根据权利要求 18所述的装置, 其特征在于, 所述告警信息处理单元, 在按照所述故障定位目标, 对所述告警故障原因进行合并分类, 得到合并分类后 的所述告警名称集时, 还用于计算合并分类后的所述告警名称集的关联度, 所述 关联度用于验证所述合并分类的有效性。 19. The device according to claim 18, characterized in that, the alarm information processing unit merges and classifies the alarm fault causes according to the fault location target, and obtains the merged and classified alarm name set. When , it is also used to calculate the correlation degree of the merged and classified alarm name sets, and the correlation degree is used to verify the validity of the merged classification.
20、 根据权利要求 17所述的装置, 其特征在于, 所述告警数据处理单元建 立所述告警集合和所述告警故障原因集合的对应关系, 包括: 根据所述告警名称 和告警故障原因之间的对应关系,建立所述告警集合和所述告警故障原因集合的 二分图。 20. The device according to claim 17, wherein the alarm data processing unit establishes a corresponding relationship between the alarm set and the alarm failure cause set, including: based on the relationship between the alarm name and the alarm failure cause. The corresponding relationship is to establish a bipartite graph of the alarm set and the alarm failure cause set.
21、 根据权利要求 17所述的装置, 其特征在于, 所述告警数据处理单元确 定预设时间窗内所述告警集合对应的故障原因发生的概率, 包括: 计算预设时间 窗内告警集合中每种告警发生的概率;计算所述每种告警对应的故障原因发生的 概率; 根据所述每种告警发生的概率和对应的每种故障原因发生的概率, 确定所 述预设时间窗内所述告警集合对应的故障原因发生的概率;选取满足预设范围的 所述故障原因发生的概率作为所述告警集合的故障原因。 21. The device according to claim 17, wherein the alarm data processing unit determines the probability of occurrence of the fault cause corresponding to the alarm set within the preset time window, including: calculating The probability of occurrence of each alarm; calculating the probability of occurrence of the fault cause corresponding to each alarm; determining the probability of occurrence of each fault cause within the preset time window based on the probability of occurrence of each alarm and the corresponding probability of occurrence of each fault cause. The probability of occurrence of the fault cause corresponding to the alarm set is selected; and the probability of occurrence of the fault cause that satisfies the preset range is selected as the fault cause of the alarm set.
22、 根据权利要求 21所述的装置, 其特征在于, 所述告警数据处理单元计 算预设时间窗内告警集合中每种告警发生的概率, 包括: 使用预定时间窗内的告 警发生的频次作为预定时间窗内告警发生的概率。 22. The device according to claim 21, wherein the alarm data processing unit calculates the probability of occurrence of each alarm in the alarm set within the preset time window, including: using the frequency of alarm occurrence within the preset time window as The probability of alarm occurrence within the predetermined time window.
23、 根据权利要求 21所述的装置, 其特征在于, 所述告警数据处理单元计 算所述每种告警对应的故障原因发生的概率, 包括: 初始化所述每种告警的故障 原因发生的概率;在所述预定时间窗内计算并验证所述每种告警的故障原因发生 的概率; 更新所述每种告警的对应的故障原因发生的概率。 23. The device according to claim 21, wherein the alarm data processing unit calculates the probability of occurrence of the fault cause corresponding to each alarm, including: initializing the probability of occurrence of the fault cause of each alarm; Calculate and verify the probability of occurrence of the fault cause of each alarm within the predetermined time window; update the probability of occurrence of the corresponding fault cause of each alarm.
24、 根据权利要求 17至 23任一项所述的装置, 其特征在于, 所述故障层级 定位单元, 具体用于对所述告警集合的故障原因进行层级合并分类, 得到各层的 合并原因; 逐级向上层计算出故障原因, 以完成故障定位目标的层级定位。 24. The device according to any one of claims 17 to 23, characterized in that the fault level positioning unit is specifically used to perform hierarchical merging and classification of the fault causes of the alarm set to obtain the merging causes of each layer; Calculate the cause of the fault step by step to the upper level to complete the hierarchical positioning of the fault location target.
PCT/CN2014/076867 2013-10-08 2014-05-06 Fault location method and device WO2015051638A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310467700.7A CN104518905A (en) 2013-10-08 2013-10-08 Fault locating method and fault locating device
CN201310467700.7 2013-10-08

Publications (1)

Publication Number Publication Date
WO2015051638A1 true WO2015051638A1 (en) 2015-04-16

Family

ID=52793677

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/076867 WO2015051638A1 (en) 2013-10-08 2014-05-06 Fault location method and device

Country Status (2)

Country Link
CN (1) CN104518905A (en)
WO (1) WO2015051638A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108521346A (en) * 2018-04-07 2018-09-11 中南大学 Method for positioning abnormal nodes of telecommunication bearer network based on terminal data
CN109474483A (en) * 2019-01-08 2019-03-15 Oppo广东移动通信有限公司 A kind of detection method, detection device and the terminal device of unit exception situation
CN110135603A (en) * 2019-05-21 2019-08-16 国网河南省电力公司信息通信公司 It is a kind of to alert space characteristics analysis method based on the electric power networks for improving entropy assessment
CN110309009A (en) * 2019-05-21 2019-10-08 北京云集智造科技有限公司 Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium
CN111061616A (en) * 2019-11-25 2020-04-24 京信通信系统(中国)有限公司 Alarm management method, device, communication equipment and storage medium
CN111431754A (en) * 2020-04-13 2020-07-17 广东电网有限责任公司东莞供电局 Fault analysis method and system for power distribution and utilization communication network
CN111813587A (en) * 2020-05-28 2020-10-23 国网山东省电力公司 Software interface evaluation and fault early warning method and system
CN112003741A (en) * 2020-08-07 2020-11-27 北京浪潮数据技术有限公司 Alarm data processing method, device, equipment and readable storage medium
CN112699005A (en) * 2020-12-30 2021-04-23 网宿科技股份有限公司 Server hardware fault monitoring method, electronic equipment and storage medium
CN112770197A (en) * 2020-12-31 2021-05-07 深圳前海微众银行股份有限公司 Method, device, equipment and storage medium for determining OTN equipment fault reason
CN113691311A (en) * 2021-08-27 2021-11-23 中国科学院半导体研究所 Fault positioning method of optical network, electronic equipment and computer readable storage medium
CN114906077A (en) * 2022-06-08 2022-08-16 中国第一汽车股份有限公司 Fault processing method and device, storage medium, processor and electronic device
CN117560389A (en) * 2023-10-13 2024-02-13 陕西小保当矿业有限公司 Mine industrial Internet platform alarm fusion method and system

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107918629B (en) * 2016-10-11 2020-09-04 北京神州泰岳软件股份有限公司 Correlation method and device for alarm fault
WO2018119776A1 (en) * 2016-12-28 2018-07-05 深圳中兴力维技术有限公司 Alarm processing method and device
CN107018013B (en) * 2017-03-10 2020-06-23 京信通信系统(中国)有限公司 Alarm reporting method and equipment
CN106941423B (en) * 2017-04-13 2018-06-05 腾讯科技(深圳)有限公司 Failure cause localization method and device
CN110311932A (en) * 2018-03-20 2019-10-08 上海鋆锦信息科技有限公司 A kind of method and device thereof of private clound remote control gateway
JP7052602B2 (en) * 2018-07-02 2022-04-12 日本電信電話株式会社 Generator, generation method and generation program
CN109270910A (en) * 2018-10-31 2019-01-25 重庆长安汽车股份有限公司 Robot fault analysis method, apparatus and system on a kind of production line
CN109828857B (en) * 2018-12-29 2022-07-05 百度在线网络技术(北京)有限公司 Vehicle fault cause positioning method, device, equipment and storage medium
CN111669282B (en) * 2019-03-08 2023-10-24 华为技术有限公司 Method, device and computer storage medium for identifying suspected root cause alarm
CN112015160B (en) * 2019-05-31 2021-10-22 北京新能源汽车股份有限公司 Fault temperature determination method and device
CN111352808B (en) * 2020-03-03 2023-04-25 腾讯云计算(北京)有限责任公司 Alarm data processing method, device, equipment and storage medium
CN113825162B (en) * 2020-06-19 2024-05-28 中国移动通信集团设计院有限公司 Method and device for positioning fault reasons of telecommunication network
CN112039695A (en) * 2020-08-19 2020-12-04 朔黄铁路发展有限责任公司肃宁分公司 Transmission network fault positioning method and device based on Bayesian inference
CN112543126A (en) * 2020-12-22 2021-03-23 武汉联影医疗科技有限公司 Cloud platform monitoring method and device, computer equipment and storage medium
CN113420155A (en) * 2021-08-25 2021-09-21 深圳市信润富联数字科技有限公司 Wheel hub defect cause prediction method, electronic device, device and readable storage medium
CN114285732A (en) * 2021-12-23 2022-04-05 中国建设银行股份有限公司 Network fault positioning method, system, storage medium and electronic equipment
CN115865625A (en) * 2022-11-28 2023-03-28 武汉烽火技术服务有限公司 Method and device for analyzing fault root cause of communication equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1553328A (en) * 2003-06-08 2004-12-08 华为技术有限公司 Fault tree analysis based system fault positioning method and device
CN101360013A (en) * 2008-09-25 2009-02-04 烽火通信科技股份有限公司 General fast fault locating method for transmission network based on correlativity analysis
CN101917297A (en) * 2010-08-30 2010-12-15 烽火通信科技股份有限公司 Method and system for diagnosing faults of core network based on Bayesian network
CN102255764A (en) * 2011-09-02 2011-11-23 广东省电力调度中心 Method and device for diagnosing transmission network failure

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100586202C (en) * 2005-09-27 2010-01-27 华为技术有限公司 Fault positioning method and device
CN101997709B (en) * 2009-08-10 2014-03-12 中兴通讯股份有限公司南京分公司 Root alarm data analysis method and system
CN102291247A (en) * 2010-06-18 2011-12-21 中兴通讯股份有限公司 Alarm association diagram generation method and device and association alarm determination method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1553328A (en) * 2003-06-08 2004-12-08 华为技术有限公司 Fault tree analysis based system fault positioning method and device
CN101360013A (en) * 2008-09-25 2009-02-04 烽火通信科技股份有限公司 General fast fault locating method for transmission network based on correlativity analysis
CN101917297A (en) * 2010-08-30 2010-12-15 烽火通信科技股份有限公司 Method and system for diagnosing faults of core network based on Bayesian network
CN102255764A (en) * 2011-09-02 2011-11-23 广东省电力调度中心 Method and device for diagnosing transmission network failure

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108521346A (en) * 2018-04-07 2018-09-11 中南大学 Method for positioning abnormal nodes of telecommunication bearer network based on terminal data
CN108521346B (en) * 2018-04-07 2020-06-02 中南大学 Method for positioning abnormal nodes of telecommunication bearer network based on terminal data
CN109474483A (en) * 2019-01-08 2019-03-15 Oppo广东移动通信有限公司 A kind of detection method, detection device and the terminal device of unit exception situation
CN110309009B (en) * 2019-05-21 2022-05-13 北京云集智造科技有限公司 Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium
CN110135603A (en) * 2019-05-21 2019-08-16 国网河南省电力公司信息通信公司 It is a kind of to alert space characteristics analysis method based on the electric power networks for improving entropy assessment
CN110309009A (en) * 2019-05-21 2019-10-08 北京云集智造科技有限公司 Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium
CN111061616A (en) * 2019-11-25 2020-04-24 京信通信系统(中国)有限公司 Alarm management method, device, communication equipment and storage medium
CN111061616B (en) * 2019-11-25 2024-03-29 京信网络系统股份有限公司 Alarm management method, device, communication equipment and storage medium
CN111431754A (en) * 2020-04-13 2020-07-17 广东电网有限责任公司东莞供电局 Fault analysis method and system for power distribution and utilization communication network
CN111813587A (en) * 2020-05-28 2020-10-23 国网山东省电力公司 Software interface evaluation and fault early warning method and system
CN111813587B (en) * 2020-05-28 2024-04-26 国网山东省电力公司 Software interface evaluation and fault early warning method and system
CN112003741A (en) * 2020-08-07 2020-11-27 北京浪潮数据技术有限公司 Alarm data processing method, device, equipment and readable storage medium
CN112699005A (en) * 2020-12-30 2021-04-23 网宿科技股份有限公司 Server hardware fault monitoring method, electronic equipment and storage medium
CN112770197A (en) * 2020-12-31 2021-05-07 深圳前海微众银行股份有限公司 Method, device, equipment and storage medium for determining OTN equipment fault reason
CN113691311A (en) * 2021-08-27 2021-11-23 中国科学院半导体研究所 Fault positioning method of optical network, electronic equipment and computer readable storage medium
CN113691311B (en) * 2021-08-27 2022-12-06 中国科学院半导体研究所 Fault positioning method of optical network, electronic equipment and computer readable storage medium
CN114906077A (en) * 2022-06-08 2022-08-16 中国第一汽车股份有限公司 Fault processing method and device, storage medium, processor and electronic device
CN117560389A (en) * 2023-10-13 2024-02-13 陕西小保当矿业有限公司 Mine industrial Internet platform alarm fusion method and system

Also Published As

Publication number Publication date
CN104518905A (en) 2015-04-15

Similar Documents

Publication Publication Date Title
WO2015051638A1 (en) Fault location method and device
US11003896B2 (en) Entity recognition from an image
US7478113B1 (en) Boundaries
US10212114B2 (en) Systems and methods for spam detection using frequency spectra of character strings
CN112468523B (en) Abnormal flow detection method, device, equipment and storage medium
KR101850993B1 (en) Method and apparatus for extracting keyword based on cluster
WO2011157012A1 (en) Method for generating alarm association graph and device thereof, and method for determining association alarm and device thereof
US20220005126A1 (en) Virtual assistant for recommendations on whether to arbitrate claims
US20160036832A1 (en) System, method and computer program product for sending information extracted from a potentially unwanted data sample to generate a signature
US10250550B2 (en) Social message monitoring method and apparatus
WO2015024476A1 (en) A method, server, and computer program product for managing ip address attributions
WO2019056496A1 (en) Method for generating picture review probability interval and method for picture review determination
CN111669379A (en) Behavior abnormity detection method and device
CN110427622A (en) Appraisal procedure, device and the storage medium of corpus labeling
US9332031B1 (en) Categorizing accounts based on associated images
Yalavarthi et al. Select your questions wisely: For entity resolution with crowd errors
CN111932427B (en) Method and system for detecting emergent public security incident based on multi-mode data
US11308212B1 (en) Adjudicating files by classifying directories based on collected telemetry data
CN117768870A (en) Equipment query method, electronic equipment and storage medium
CN113495886A (en) Method and device for detecting pollution sample data for model training
WO2015139569A1 (en) Method and gateway device for obtaining file reputation and file reputation server
WO2018054352A1 (en) Item set determination method, apparatus, processing device, and storage medium
CN113824755A (en) Method, system and related device for processing block chain data
CN111182533A (en) Internet attack group positioning method and system
CN108229585B (en) Log classification method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14852741

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14852741

Country of ref document: EP

Kind code of ref document: A1