WO2016173473A1 - 一种定位故障的方法和设备 - Google Patents

一种定位故障的方法和设备 Download PDF

Info

Publication number
WO2016173473A1
WO2016173473A1 PCT/CN2016/080096 CN2016080096W WO2016173473A1 WO 2016173473 A1 WO2016173473 A1 WO 2016173473A1 CN 2016080096 W CN2016080096 W CN 2016080096W WO 2016173473 A1 WO2016173473 A1 WO 2016173473A1
Authority
WO
WIPO (PCT)
Prior art keywords
network node
dependency
fault
alarm
value
Prior art date
Application number
PCT/CN2016/080096
Other languages
English (en)
French (fr)
Inventor
王烽
梁治平
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2016173473A1 publication Critical patent/WO2016173473A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and device for locating faults.
  • the device In the maintenance of the data center, when the device fails, the device will issue an alarm. In addition to the alarm, the device that has the associated relationship with the faulty device will also send an alarm. Such a large number of alarms will reduce the efficiency of the maintenance personnel to determine the root cause of the failure, and additionally increase the time for the maintenance personnel to repair the failure.
  • a scheme based on the fault rule associated alarm is proposed.
  • the solution presets a fault rule, and all the alarms that are generated are introduced into the fault rule engine, and the fault rule engine uses the preset fault rule to locate the root cause of the fault.
  • the current fault source determination method is relatively fixed according to the preset fault rule, so that the efficiency of the positioning fault is relatively low.
  • the embodiments of the present invention provide a method and a device for locating faults, which are used to solve the problem of low fault location efficiency.
  • a method of locating a fault comprising:
  • each of the fault alarms includes an identifier of a first network node that issues a fault alarm and an alarm type that the first network node issues a fault alarm, where the alarm type includes an application type and a link. At least one of a type and a device type;
  • a dependency chain comprising the first network node and other network nodes having a dependency transfer relationship with the first network node, wherein the dependency chain is used to characterize from the first network a dependency transfer relationship between the node and each of the other network nodes, the dependency transfer relationship including at least one of a connection relationship and an inclusion relationship;
  • the dependency rule includes that the application type depends on a link type, and the link type depends on at least one of the device types;
  • the network node that causes the first network node to fail includes:
  • the second network node is written into the fault root source list, and the downstream network node having the direct dependent transfer relationship with the second network node is selected as the network node performing the next round operation;
  • the downstream network node having the direct dependence transfer relationship with the second network node is selected as the network node performing the next round operation
  • the network node included in the fault root list is located as a network node that causes the first network node to fail;
  • the network node in the most upstream of the dependency chain refers to a dependency transfer relationship in which the network node depends on other network nodes except the network node in the dependency chain.
  • the working of the second network node, the working of the downstream network node having the direct dependency relationship The state and the working state of the upstream network node that is directly dependent on the transfer relationship, and determining whether the second network node is a network node that causes the network node that issued the fault alarm to fail includes:
  • the working state of the second network node is an abnormal state, further determining that it does not have a downstream network node directly dependent on the transfer relationship, or having a downstream network node directly dependent on the transfer relationship causes the first network node to occur
  • the suspected root cause of the fault is greater than the first threshold.
  • the second network node is determined to be a network node that causes the first network node to fail.
  • the positioning is performed from each of the other network nodes included in the dependency chain according to an operating state of each of the other network nodes included in the dependency chain.
  • a network node that causes the first network node to fail including:
  • the determined number of the dependent chains is at least two, respectively calculating a suspected degree value of the faulty root cause of the failure of the first network node by the downstream network node having the direct dependency relationship with the first network node ;
  • the calculating is performed with the first network node
  • the network nodes belonging to the same fault alarm level are respectively counted. Two values;
  • the determining the total value, the first value, and the second value, Calculating a suspicion value of the root cause of the failure of the first network node that is caused by the downstream network node that has a direct dependency relationship with the first network node including:
  • the S 1i is a suspicion value of the root cause of the failure of the network node that issues the fault alarm by the first i-downstream network node that has a direct dependency relationship with the first network node, and the value range of i 1 to N, N is a natural number, m 1i is a total value of an upstream network node having a dependent transfer relationship with the downstream network node, and n 1i is a network node whose working state is in an abnormal state in the upstream network node
  • the first value, w 1i is the second value of the network node belonging to the same level of fault alarm level.
  • a device for locating a fault including:
  • a receiving unit configured to receive at least one fault alarm, where each of the fault alarms includes an identifier of a first network node that issues a fault alarm and an alarm type that the first network node issues a fault alarm, where the alarm type includes At least one of an application type, a link type, and a device type;
  • a searching unit configured to search for an alarm that satisfies the first network node to send a fault alarm according to the identifier of the first network node that sends the fault alarm and the alarm type that the first network node sends the fault alarm included in the fault alarm a dependency rule corresponding to the type and including a dependency transfer relationship of the first network node;
  • a determining unit configured to determine, according to the dependency transfer relationship, a dependency chain including the first network node and other network nodes having a dependency transfer relationship with the first network node, wherein the dependency chain is used to represent a slave a dependency transfer relationship between the first network node and each of the other network nodes, where the dependency transfer relationship includes at least one of a connection relationship and an inclusion relationship;
  • a positioning unit configured to locate, according to an operating state of each of the other network nodes included in the dependency chain, from each of the other network nodes included in the dependency chain, causing the first network node to fail Network node.
  • the dependency rule includes that the application type is dependent on a link type, and the link type depends on at least one of the device types;
  • the searching unit is specifically configured to determine, according to the alarm type that the first network node sends the fault alarm included in the fault alarm, a dependency rule that satisfies the alarm type;
  • the positioning unit is specifically configured to use the dependency chain Starting at the most upstream network node, the following operations are performed in sequence until the execution of each of the other network nodes included in the dependency chain ends:
  • the second network node is written into the fault root source list, and the downstream network node having the direct dependent transfer relationship with the second network node is selected as the network node performing the next round operation;
  • the downstream network node is a network node performing the next round of operations
  • the network node included in the fault root list is located as a network node that causes the first network node to fail;
  • the network node in the most upstream of the dependency chain refers to a dependency transfer relationship in which the network node depends on other network nodes except the network node in the dependency chain.
  • the positioning unit is configured to be used when the working state of the second network node is abnormal Further determining that the downstream network node that does not have a direct dependence on the transfer relationship or the downstream network node that directly depends on the transfer relationship has a suspicion value that causes the failure of the first network node to be greater than a set threshold value. Determining that the second network node is a network node that causes the first network node to fail.
  • the positioning unit is specifically configured to: when the determined number of the dependent chains is at least two, respectively, the calculation is directly performed with the first network node A suspected degree value of a faulty root cause of the failure of the first network node due to a downstream network node that relies on the transfer relationship;
  • the positioning unit calculates, by the downstream network node that has a direct dependency relationship with the first network node, causing the first network node to fail.
  • the suspicion value of the root cause is specifically used to:
  • the network nodes belonging to the same fault alarm level are respectively counted. Two values;
  • the positioning unit according to the determined total value, the first value, and the The two values are used to calculate the suspicion value of the faulty root cause of the failure of the first network node by the downstream network node that has a direct dependency relationship with the first network node, and is specifically used to:
  • the S 1i is a suspicion value of the root cause of the failure of the network node that issues the fault alarm by the first i-downstream network node that has a direct dependency relationship with the first network node, and the value range of i 1 to N, N is a natural number, m 1i is a total value of an upstream network node having a dependent transfer relationship with the downstream network node, and n 1i is a network node whose working state is in an abnormal state in the upstream network node
  • the first value, w 1i is the second value of the network node belonging to the same level of fault alarm level.
  • the embodiment of the present invention receives at least one fault alarm, and each of the fault alarms includes an identifier of a first network node that issues a fault alarm and an alarm type that the first network node issues a fault alarm, where the alarm type includes an application type, At least one of a link type and a device type; searching for the first one according to the identifier of the first network node that sends the fault alarm included in the fault alarm and the type of the alarm that the first network node sends the fault alarm.
  • the network node sends a dependency rule corresponding to the alarm type of the fault alarm and includes a dependency transfer relationship of the first network node; determining, according to the dependency transfer relationship, that the first network node is included and has a dependency with the first network node a dependency chain of other network nodes that communicate relationships, the dependency chain being used to characterize a dependency transfer relationship from the first network node to each of the other network nodes, the dependency transfer relationship including a connection relationship, an inclusion relationship At least one of: according to each of the other
  • FIG. 1 is a schematic flowchart of a method for locating a fault according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a method for locating a fault according to an embodiment of the present invention
  • Figure 3 is a network topology diagram of a fault alarm
  • Figure 4 is a network topology diagram of a fault alarm
  • Figure 5 is a network topology diagram of a fault alarm
  • Figure 6 is a network topology diagram of a fault alarm
  • FIG. 7 is a schematic structural diagram of a device for locating a fault according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a device for locating a fault according to an embodiment of the present invention.
  • an embodiment of the present invention provides a method and an apparatus for locating a fault, and receiving at least one fault alarm, where each of the fault alarms includes an identifier of the first network node that issues a fault alarm and the first An alarm type of a network node that sends a fault alarm, the alarm type includes at least one of an application type, a link type, and a device type; and an identifier and a location of the first network node that sends a fault alarm according to the fault alarm.
  • the dependency transfer relationship Determining, by the first network node, an alarm type of the fault alarm, searching for a dependency rule corresponding to the alarm type that the first network node sends the fault alarm, and including a dependency transfer relationship of the first network node; according to the dependency transfer relationship, Determining a dependency chain comprising the first network node and other network nodes having a dependency transfer relationship with the first network node, the dependency chain being used to characterize from the first network node to each of the other network nodes
  • the dependency transfer relationship includes a connection relationship, an inclusion relationship Depending on the working status of each of the other network nodes included in the dependency chain, from among the other network nodes included in the dependency chain, locating the network that causes the first network node to fail Node, that is, in the embodiment of the present invention, regardless of whether a local fault alarm occurs in the system, or is global
  • the fault alarm by using a dependency transfer relationship between different network nodes and a type of alarm generated, may determine at least one dependency chain of the network node that includes the fault
  • dependency transfer relationship includes a direct dependency transfer relationship and an indirect dependent transfer relationship.
  • the direct dependency transfer relationship refers to a direct dependency transfer relationship between the relying party and the dependent party;
  • the indirect dependent transfer relationship refers to the dependent person, the first level dependent party, and the second level dependent party,
  • the dependency between the relying party and the second-level dependant is called an indirect dependent transitive relationship.
  • dependency transfer relationship includes at least one of a connection relationship and an inclusion relationship.
  • the object at the top of the dependency chain is called the most upstream dependent object, and the other objects in the dependency chain belong to the dependent object of the most upstream dependent object.
  • FIG. 1 is a schematic flowchart diagram of a method for locating a fault according to an embodiment of the present invention. The method can be as follows.
  • Step 101 Receive at least one fault alarm.
  • Each of the fault alarms includes an identifier of a first network node that issues a fault alarm and an alarm type that the first network node issues a fault alarm, where the alarm type includes an application type, a link type, and a device type. At least one.
  • the control device receives the fault alarm sent by the different network node, and determines the identifier of the first network node that sends the fault alarm and the type of the alarm that the first network node sends the fault alarm according to the received fault alarm, for example: A fault alarm belonging to the application type, a fault alarm belonging to the link type, a fault alarm belonging to the device type, and the like.
  • Step 102 Search for an alarm type that satisfies the first network node to send a fault alarm according to the identifier of the first network node that sends the fault alarm and the alarm type that the first network node sends the fault alarm.
  • Dependency rules and include a dependency transfer relationship of the first network node.
  • step 102 since the dependency transfer relationship between different network nodes can be abstracted into a network topology diagram by the data center Topo model, there is a dependency transfer relationship between the network nodes in the network topology map, and the dependency transfer relationship may include the network.
  • the application-virtual-physical-to-physical mapping relationship may also include link relationships between different network layers, for example, a Layer 2 link, a Layer 3 link, and the like.
  • application-virtual machine-dependent transfer relationship between physical devices application dependencies run on virtual machines, and virtual machines depend on running on physical devices. That is to say, the application belongs to the upper layer dependent object of the virtual machine, the virtual machine belongs to the lower layer dependent object of the application; the virtual machine belongs to the upper layer dependent object of the physical device; the physical device belongs to the lower layer dependent object of the virtual machine.
  • the dependency transfer relationship described in the embodiment of the present invention includes a connection relationship, for example, a data connection is established between the network node 1 and the network node 2, and then the dependency transfer relationship between the network node 1 and the network node 2 is The connection relationship is also included.
  • the application runs on the virtual machine and can describe the virtual machine included in the application, and then the dependency transfer relationship between the application and the virtual machine is an inclusion relationship.
  • the control device When receiving the fault alarm, the control device searches for the first network node according to the identifier of the first network node that sends the fault alarm and the alarm type that the first network node sends the fault alarm, which is included in the fault alarm.
  • the alarm type of the fault alarm corresponds to a dependency rule and includes a dependency transfer relationship of the first network node.
  • the dependency rule includes that the application type depends on the link type, and the link type depends on the device. At least one of the types.
  • determining, according to the type of the alarm that the first network node sends the fault alarm included in the fault alarm determining a dependency rule that satisfies the type of the alarm
  • the dependency rule that satisfies the alarm type is determined: the application type depends on Link type, link type depends on device type.
  • the application that sends the fault alarm is found on the first network node, and the application is used as a starting point to determine a dependency transfer relationship including the application, for example, the application on the first network node depends on the virtual machine on the first network node; The virtual machine on the network node depends on the physical device of the first network node and the like.
  • the dependency rule that satisfies the alarm type is determined as follows: Depends on the device type.
  • the link 1 that sends the fault alarm is found on the first network node, and the link 1 is used as the starting point to determine the dependency transfer relationship including the link 1.
  • the link 1 on the first network node depends on the first network.
  • Step 103 Determine, according to the dependency transfer relationship, a dependency chain including the first network node and other network nodes having a dependency transfer relationship with the first network node.
  • the dependency chain is used to represent a dependency transfer relationship from the first network node to each of the other network nodes, and the dependency transfer relationship includes at least one of a connection relationship and an inclusion relationship.
  • step 103 determining, according to the determined dependency transfer relationship, a first network node that sends a fault alarm and an alarm type, starting with the first network node, establishing the first network node and the first A network node has a dependency chain of other network nodes that rely on a transitive relationship.
  • the dependencies between them The transfer relationship can be expressed as: application 1 is dependent on application 2; application 2 is dependent on virtual machine 3; virtual machine 3 is dependent on physical device 4. If the application 1 issues a failure alarm, the dependency chain including the application 1 and the application 2, the virtual machine 3, and the physical device 4 having the transfer dependency with the application 1 is determined according to the dependency transfer relationship between them, that is, the application 1 - application 2 - Virtual Machine 3 - Physical Device 4.
  • the starting point of the dependent transmission relationship to be found is determined.
  • a dependency chain including the start point and other network nodes having a dependency transfer relationship with the start point is established from the determined starting point.
  • the dependency transfer relationship described herein includes at least one of a direct dependency transfer relationship and an indirect dependent transfer relationship.
  • the number of fault alarms received is not limited to one.
  • the fault alarms can be classified according to the alarm type.
  • the dependency chain of the network node that contains the alarm that the fault alarm belongs to the application type is preferentially established, that is, if the fault alarm received includes both the application type fault alarm and the device type fault alarm.
  • the dependency chain is first determined by using the network node that issues the application type of the fault alarm as a starting point.
  • network nodes having a direct dependency relationship and/or an indirect dependency transfer relationship with the network node that issues the fault alarm may include other network nodes that issue a fault alarm, or may not include other fault alarms. Whether the network node includes a dependency transfer relationship depending on the network node that issued the fault alarm.
  • Step 104 Locating a network node that causes the first network node to fail from each of the other network nodes included in the dependency chain according to an operating state of each of the other network nodes included in the dependency chain .
  • step 104 for each of the dependent chains determined in step 103, the network node in the dependency chain that causes the first network node to fail is determined in the following manner, and the positioning is caused to be caused.
  • the root cause of the failure of the fault alarm is to locate the obtained network node as the root cause of the failure of the first network node.
  • each dependency chain starting from the most upstream network node in the dependency chain, the following operations are sequentially performed until the execution of each of the other network nodes included in the dependency chain ends:
  • a second network node performing the current round operation is determined.
  • the second step determining the second network node according to the working state of the second network node, the working state of the downstream network node having the direct dependency relationship, and the working state of the upstream network node having the direct dependency relationship Whether it is a network node that causes the first network node to fail, if yes, perform the fourth step; if not, perform the third step.
  • the working state may be a normal running state or an abnormal running state
  • the abnormal running state may be referred to as a disabled state.
  • the disabled state may include a link failure, a device function failure, and a device function partial failure.
  • the switch has 48 ports, of which 10 ports cannot be used. This indicates that the function part of the switch is invalid. For the links of the 10 ports that cannot be used, the working state is disabled.
  • the determined working state of the second network node when the determined working state of the second network node is an abnormal state, further determining that it does not have a downstream network node that directly depends on the transfer relationship, or that the downstream network node that directly depends on the transfer relationship causes the When the suspected degree of the fault source of the first network node is greater than the set first threshold, determining that the second network node is a network node that causes the first network node to fail.
  • the determined working state of the second network node is an abnormal state, further determining that it has a downstream network node directly dependent on the transfer relationship, and/or its downstream network node having a direct dependency relationship, When the suspected degree of the fault source of the first network node is not greater than the set first threshold, determining that the second network node is not a network node that causes the first network node to fail.
  • the downstream network node having the direct dependence transfer relationship with the second network node is selected as the network node performing the next round of operations.
  • the second network node is written into the fault root source list.
  • the network node included in the fault root list is located as a network node that causes the first network node to fail.
  • the network node that is the most upstream in the dependency chain refers to a dependency transfer relationship in which the network node depends on other network nodes except the network node in the dependency chain. .
  • the determined number of the dependency chains is at least two, respectively calculating a fault source that causes the first network node to fail due to a downstream network node that has a direct dependency transfer relationship with the first network node.
  • Suspected degree value if the determined number of the dependency chains is at least two, respectively calculating a fault source that causes the first network node to fail due to a downstream network node that has a direct dependency transfer relationship with the first network node.
  • the first dependency chain can be expressed as: application 1 - application 2 - virtual machine 3 - physical device 4; the second dependency chain can be expressed as: application 1 - application 3-virtual machine 5-physical device 6.
  • the suspicion degree value 1 of the root cause of the failure of the application 1 causing the failure of the application 1 and the suspicion degree value 2 of the root cause of the failure of the application 3 causing the failure of the application 1 are respectively calculated.
  • the application chain 3 and the dependency chain of the application 1 determine the root cause of the failure that causes the application 1 to issue a failure alarm.
  • calculating, by the downstream network node that has a direct dependency relationship with the first network node, a suspicion value of the root cause of the failure of the first network node including:
  • the network nodes belonging to the same fault alarm level are respectively counted. Two values;
  • the degree of fault alarm can be classified into three types: high, medium, and low.
  • different alarm levels can be selected according to the degree of fault. In this way, for a network node that is in an abnormal working state, the number of network nodes of the same type of alarm severity can be counted according to different alarm levels.
  • the suspicion value of the root cause of the failure of the network node including:
  • the S 1i is a suspicion value of the root cause of the failure of the first network node that causes the first network node to be faulty, and the value of i is 1 Up to N, N is a natural number, m 1i is a total value of an upstream network node having a dependent transfer relationship with the downstream network node, and n 1i is the first of the network nodes whose working state is in an abnormal state in the upstream network node
  • the value, w 1i is the second value of the network node belonging to the same fault alarm level.
  • first and second included in the “first network node” and the “second network node” in the embodiments of the present invention have no substantial meaning and are only used to indicate two different Network node.
  • the at least one fault alarm is received by the solution of the embodiment of the present invention, where the fault alarm includes an identifier of the first network node that sends the fault alarm and an alarm type that the first network node sends the fault alarm, and the alarm type Include at least one of an application type, a link type, and a device type; and according to the identifier of the first network node that sends the fault alarm included in the fault alarm and the alarm type that the first network node sends the fault alarm, the search is satisfied.
  • FIG. 2 is a schematic flowchart diagram of a method for locating a fault according to an embodiment of the present invention. Based on the inventive concept of the root cause of the fault shown in FIG. 1, several positioning fault rules are determined in advance according to the dependency relationship, and when the fault alarm is received, the dependence between each network node and the network node that issues the fault alarm is determined. The relationship and the determined location fault rule locate the cause of the failure of the network node that issued the fault alert.
  • the method can be as follows.
  • Positioning fault rule 1 The upper-layer dependent object fails, and the fault root source is located from the lower-level dependent object of the upper-layer dependent object that has failed.
  • Positioning fault rule 2 The way to locate the root cause of the fault from the lower-level dependent object of the upper-level dependent object that failed: Calculate the suspected degree value of the fault-caused lower-level dependent object.
  • Positioning fault rule 3 For the faulty object, if the faulty object is satisfied, some or all of the upper-layer dependent objects are faulty, and the faulty object has no faulty lower-level dependent object, then the faulty object is determined. The root cause of the failure.
  • Step 201 When receiving the fault alarm, according to the network topology diagram of the system and the location fault rule 1, obtain a fault suspect list that causes the fault alarm to occur.
  • the fault suspect list includes a fault object that issues a fault alarm and a fault object whose working state is an abnormal running state.
  • the fault suspect list S t ⁇ A 1 , A 2 , A 3 , . . . , A p , . . . , A q ⁇ is obtained, where t, p and q are natural numbers.
  • Step 202 Select a fault object from the fault suspect list as the current round of reasoning object.
  • Step 203 Calculate the suspect degree value of the fault object A p as the fault source according to the positioning fault rule 2.
  • the suspicion value of the fault object A p is determined as a fault source, which specifically includes:
  • the total value of the dependent object having the upper-level dependency relationship with the fault object A p is determined.
  • the first value of the dependent object whose working state is an abnormal working state in the dependent object having the upper-level dependency relationship with the fault object A p is determined.
  • the upper layer includes a dependency on the failure target object A p-dependent non-operating state of the dependent objects normal operation, the extent of fault alarm issued according to various dependent objects, to obtain the degree of a fault alarm corresponding to each type of Depends on the second value of the object.
  • the suspected object A p is calculated as the suspected degree value of the fault source causing the fault.
  • the suspicion value of the fault object A p which is the source of the fault causing the fault is calculated by:
  • S p is a susceptibility value that calculates the fault object A p to be the root cause of the fault, and p ranges from 1 to q, q is a natural number, and m p has an upper layer dependency on the fault object A p
  • the total value of the fault object of the relationship n p is the first value of the fault object whose working state is an abnormal working state in the fault object having the upper layer dependency relationship with the fault object Ap , and w p is the occurrence of each type of fault
  • Step 204 Determine whether the fault object A p has a lower-level dependent object that fails. If yes, execute step 206; if not, execute step 205.
  • Step 205 determining the selected target fault A p results in an upper layer part or all dependent objects failure, faulty underlying dependent objects do not exist, according to the positioning fault rule three, the object A p write fault list of suspected fault source.
  • step 205 the fault is removed from the list of objects A p suspected fault, the fault source list and writes the object A p suspected fault, the fault list is further determined whether there is a fault suspect object is not reasoning, if present, Then, the process proceeds to step 202. If not, step 208 is performed.
  • Step 206 When it is determined that the selected fault object Ap causes some or all of the upper layer dependent objects to be faulty, and there is a faulty lower layer dependent object, the upper layer dependent object of the fault object Ap is moved out of the fault suspect list.
  • Step 207 a lower layer respectively, according to the dependency, the lower failure calculate the fault objects A p depends directly on the object a root cause failure of the failure level of the suspect values and fault objects A p indirectly dependent objects become root cause failure The value of the suspect level, until the lower-level indirect dependent object of the failure of the fault object A p no longer has the faulty lower-level dependent object, write the calculated fault object with the largest suspect degree value to the fault root cause suspect list, and jump Go to step 202.
  • Step 208 Locate the root cause of the fault that causes the fault alarm to be generated from the fault root suspect list.
  • step 208 from the fault root suspect list, the fault object with the largest suspect level value is selected as the fault source that causes the fault alarm to occur.
  • FIG. 3 is a network topology diagram of a fault alarm.
  • the embodiment of the present invention provides a method for locating a fault source by taking a multi-link fault as an example. To illustrate the method of locating the root cause of a fault in a multi-link failure scenario. The method can be as follows.
  • SW1 port A is disabled and SW2 port A is disabled.
  • the pre-processing results are: Link13 causes SW1 and SW3 to be disabled, and Link22 causes SW2 and Host2 to be disabled.
  • the root cause of the failure that caused SW1 port A to be disabled and the root cause of the failure that caused SW2 port A to be disabled are determined in turn as described above.
  • the SW1 port A is used as the starting point to determine the dependency chain 11 of the other network nodes including the SW1 port A and the SW1 port A, and the disabling of the SW2 port A as the starting point.
  • Contains SW2 port A and other transfer-dependent relationships with SW2 port A The dependency chain of the network node 12.
  • the dependency chain 11 determined by using the SW1 port A disabling as a starting point includes Link13, SW1, and SW3 connected through SW1 port A and SW3, wherein SW1 port A and Link13 have a direct dependency transfer relationship, Link13 and SW1. There is a direct dependency transfer relationship with SW3.
  • the dependency chain 12 determined by starting the SW2 port A disabling includes Link22, SW2, and Host2 connected through SW2 port A and SW3, wherein SW2 port A and Link23 have a direct dependency transfer relationship, Link22 and SW2 and Host2 There is a direct dependence on the transfer relationship.
  • Figure 4 is a network topology diagram of a fault alarm.
  • the embodiment of the present invention provides a method for locating the root cause of the fault by taking the device fault (no loss of association) as shown in FIG. 4 as an example. To illustrate the method of locating the root cause of the fault in the case of equipment failure (no loss of connection). The method can be as follows.
  • SW1 port B issues a fault alarm and SW3 port A issues a fault alarm.
  • the pre-processing results are: Link12 causes SW1 and SW2 to be disabled; Link23 causes SW2 and SW3 to be disabled. According to the above manner, the root cause of the fault causing the SW1 port B to issue a fault alarm and the SW3 port A to issue a fault alarm are sequentially determined.
  • the SW1 port B is used as a starting point to determine the dependency chain 21 of the other network node including the SW1 port B and the SW1 port B
  • the SW3 port A is used as the starting point to determine the SW3 port A.
  • the dependency chain 21 determined by using the SW1 port B as a starting point includes Link12, SW1, and SW2 connected to the SW2 through the SW1 port B, wherein the SW1 port A and the Link 12 have a direct dependency transfer relationship, and the link 12 and the SW1 and the SW3 are connected. There is a direct dependency relationship between them.
  • the dependency chain 22 determined by the SW3 port A disabling is included in the SW3 port A and SW2. Connected to Link23, SW2 and SW3, where SW3 port A and Link23 have a direct dependency transfer relationship, and Link23 has a direct dependency transfer relationship between SW2 and SW3.
  • the fault suspect list that causes the fault alarm is determined.
  • the fault suspect list includes: Link12, Link23 and SW2.
  • the third step is to calculate the suspicion value of Link 12 causing SW1 port B to issue a fault alarm and calculate the suspicion value of Link 23 causing SW3 port A to issue a fault alarm.
  • the fifth step when the suspicion value of the SW2 causing the alarm is greater than the suspicion value of the SW1 port B, and the suspicion value of the SW3 port A, the Link12 and Link23 are excluded. Suspected of issuing an alarm.
  • SW2 causes part or all of the upper-layer dependent object to be disabled, and there is no lower-level dependent object, according to the positioning fault rule 3, it is determined that SW2 is the fault source of the SW1 port B alarm and the SW3 port A alarm.
  • Figure 5 is a network topology diagram of a fault alarm.
  • the embodiment of the present invention provides a method for locating the root cause of the fault by taking the device fault (disconnection) as shown in FIG. 5 as an example. To illustrate the method of locating the root cause of the fault in the case of equipment failure (loss of connection). The method can be as follows.
  • SW1 port B issues a fault alarm and SW2 loses connection.
  • the pre-processing results are: Link2 causes SW1 and SW2 to be disabled; SW2 causes L3 to lose. According to the above manner, the root cause of the fault that causes the SW1 port B to issue a fault alarm and the SW2 to lose the link is determined in turn.
  • the SW1 port B is used as a starting point, and the dependency chain 31 of the other network nodes having the dependency transfer relationship between the SW1 port B and the SW1 port B is determined, and the SW2 and the SW2 are determined by using the SW2 as a starting point.
  • the dependency chain 31 determined by using the SW1 port B as a starting point includes Link01, Link12 connected to SW2 through SW1 port B, and IP02, SW1, and SW2 connected through M1 and SW2 through SW1 port B, where SW1 port B and Link12 and IP02 have direct dependency transfer relationship, Link12 There is a direct dependency transfer relationship between SW1 and SW2.
  • SW2 is included in the dependency chain 32 determined by starting from SW2.
  • the calculated suspicion degree of Link01 is 0, and the suspicion degree of Link12 is 100.
  • the IP02 is considered to be the source of the fault.
  • Link12 is used as the inference object.
  • the lower-level dependent objects of Link12 include SW1 and SW2, respectively, and calculate the root cause of the fault caused by SW1 and SW2.
  • the calculated SW1 causes the root cause of the alarm to be 0; the calculated SW2 causes the root cause of the alarm to be ⁇ .
  • SW2 is the lower-level dependent object of Link12, and SW2 is disconnected, SW2 is used as the inference object.
  • SW2 since the suspected value of SW2 is greater than the suspected value of SW1, it is suspected that SW1 is the source of the fault. Since SW2 is disconnected and there is no lower-level dependent object, according to the third-level positioning fault rule, it is determined that SW2 is the root cause of the fault that causes the alarm.
  • Figure 6 is a network topology diagram of a fault alarm.
  • the embodiment of the present invention provides a method for locating the root cause of the fault by taking the device fault (disconnection) as shown in FIG. 6 as an example. To illustrate the method of locating the root cause of the fault in the case of equipment failure (loss of connection). The method can be as follows.
  • SW1 port B issues a fault alarm, and SW2, host4, and host5 are lost. According to the above manner, the root cause of the fault that causes the SW1 port B to issue a fault alarm and the SW2, host4, and host5 to be disconnected is determined in turn.
  • IP02 includes Link01 and Link12.
  • IP04 Since host4 is disconnected, IP04 is directly dependent on the underlying layer of host4.
  • IP04 includes Link01, Link12 and link24.
  • IP05 Since host5 is disconnected, IP05 is directly dependent on the underlying layer of host5.
  • IP05 includes Link01, Link12 and link25.
  • the second step is to calculate the suspicion value of Link01 and Link12 causing the fault alarm for IP02.
  • IP04 calculate the suspicion value of Link01, Link12 and link24 respectively to cause the fault alarm;
  • IP05 calculate Link01, Link12 and link25 respectively.
  • the calculated degree of suspicion of Link01 is 2
  • the suspicion value of Link12 is calculated as 102
  • the suspicion degree of Link24 is calculated as 0,
  • the suspicion degree of Link25 is 0.
  • Link12 is used as the inference object.
  • the lower-level dependent objects of Link12 are SW1 and SW2, respectively, and the suspicion values of SW1 and SW2 that cause the fault alarm are generated.
  • the calculated SW1 causes the suspicion level of the fault alarm to be 0; the calculated SW2 causes the suspicion level of the fault alarm to be 0.
  • Link12 is the root cause of the fault that causes SW1 port B to issue a fault alarm and causes SW2, host4, and host5 to lose connectivity.
  • FIG. 7 is a schematic structural diagram of a device for locating a fault according to an embodiment of the present invention.
  • the illustrated device includes a receiving unit 71, a lookup unit 72, a determining unit 73, and a positioning unit 74, wherein:
  • the receiving unit 71 is configured to receive at least one fault alarm, where each of the fault alarms includes an identifier of a first network node that issues a fault alarm and an alarm type that the first network node issues a fault alarm, and the alarm type Include at least one of an application type, a link type, and a device type;
  • the searching unit 72 is configured to search for the first network according to the identifier of the first network node that sends the fault alarm included in the fault alarm and the alarm type that the first network node sends the fault alarm.
  • the network node sends a dependency rule corresponding to the alarm type of the fault alarm and includes a dependency transfer relationship of the first network node;
  • a determining unit 73 configured to determine, according to the dependency transfer relationship, a dependency chain including the first network node and other network nodes having a dependency transfer relationship with the first network node, where the dependency chain is used for characterization a dependency transfer relationship from the first network node to each of the other network nodes, the dependency transfer relationship including at least one of a connection relationship and an inclusion relationship;
  • the locating unit 74 is configured to: according to the working states of each of the other network nodes included in the dependency chain, positioning from each of the other network nodes included in the dependency chain causes the first network node to be faulty Network node.
  • the dependency rule includes that the application type is dependent on a link type, and the link type depends on at least one of the device types;
  • the searching unit 72 is specifically configured to determine, according to the alarm type that the first network node includes the fault alarm included in the fault alarm, a dependency rule that satisfies the alarm type;
  • the positioning unit 74 is specifically configured to start from the network node that is the most upstream in the dependency chain, and sequentially perform the following operations until the execution of each of the other network nodes included in the dependency chain ends:
  • the second network node is written into the fault root source list, and the downstream network node having the direct dependent transfer relationship with the second network node is selected as the network node performing the next round operation;
  • the downstream network node is a network node performing the next round of operations
  • the network node included in the fault root list is located as a network node that causes the first network node to fail;
  • the network node in the most upstream of the dependency chain refers to a dependency transfer relationship in which the network node depends on other network nodes except the network node in the dependency chain.
  • the positioning unit 74 is specifically configured to further determine, when the working state of the second network node is an abnormal state, a downstream network node that does not directly depend on the transfer relationship, or has a direct dependency transfer relationship. And determining, by the second network node, a network node that causes the first network node to fail, when a suspect degree value of the fault source causing the first network node to be faulty is greater than a set first threshold.
  • the positioning unit 74 is specifically configured to: if the determined number of the dependent chains is at least two, respectively calculate a downstream network node that has a direct dependency relationship with the first network node, causing the The suspicion value of the root cause of the failure of a network node failure;
  • the locating unit 74 calculates a suspicion value of a fault source that causes the first network node to be faulty, and the downlink network node that has a direct dependency relationship with the first network node is specifically configured to:
  • the network nodes belonging to the same fault alarm level are respectively counted. Two values;
  • the positioning unit 74 calculates, according to the determined total value, the first value, and the second value, a downstream network node that has a direct dependency relationship with the first network node.
  • the suspicion value of the fault root cause of the failure of the first network node is specifically used to:
  • the S 1i is a suspicion value of the root cause of the failure of the network node that issues the fault alarm by the first i-downstream network node that has a direct dependency relationship with the first network node, and the value range of i 1 to N, N is a natural number, m 1i is a total value of an upstream network node having a dependent transfer relationship with the downstream network node, and n 1i is a network node whose working state is in an abnormal state in the upstream network node
  • the first value, w 1i is the second value of the network node belonging to the same level of fault alarm level.
  • the device provided by the embodiment of the present invention may be implemented in a software manner or in a hardware manner, which is not limited herein.
  • FIG. 8 is a schematic structural diagram of a device for locating a fault according to an embodiment of the present invention.
  • the device described in the embodiment of the present invention can be implemented by using a general computer structure.
  • the device includes: a signal receiver 81 and a processor 82, wherein the signal receiver 81 and the processor 82 can communicate with each other. Connected via bus 83.
  • the signal receiver 81 is configured to receive at least one fault alarm, where each of the fault alarms includes an identifier of a first network node that issues a fault alarm and an alarm type that the first network node issues a fault alarm.
  • the alarm type includes at least one of an application type, a link type, and a device type;
  • the processor 82 is configured to: according to the identifier of the first network node that sends the fault alarm included in the fault alarm and the alarm type that the first network node sends a fault alarm, find that the first network node is faulty.
  • a dependency chain comprising the first network node and other network nodes having a dependency transfer relationship with the first network node, wherein the dependency chain is used to characterize from the first network a dependency transfer relationship between the node and each of the other network nodes, the dependency transfer relationship including at least one of a connection relationship and an inclusion relationship;
  • the dependency rule includes that the application type is dependent on a link type, and the link type depends on at least one of the device types;
  • the processor 82 searches for the first type of the network node that sends the fault alarm and the type of the alarm that the first network node sends the fault alarm according to the fault alarm.
  • a network node sends a dependency rule corresponding to the alarm type of the fault alarm and includes a dependency transfer relationship of the first network node, including:
  • the processor 82 locates from each of the other network nodes included in the dependency chain according to the working state of each of the other network nodes included in the dependency chain.
  • the network node where the first network node fails includes:
  • the second network node is written into the fault root source list, and the downstream network node having the direct dependent transfer relationship with the second network node is selected as the network node performing the next round operation;
  • the downstream network node having the direct dependence transfer relationship with the second network node is selected as the network node performing the next round operation
  • the network node included in the fault root list is located as a network node that causes the first network node to fail;
  • the network node in the most upstream of the dependency chain refers to a dependency transfer relationship in which the network node depends on other network nodes except the network node in the dependency chain.
  • the processor 82 operates according to the working state of the second network node, the working state of the downstream network node that directly depends on the transfer relationship, and the operation of the upstream network node having the direct dependency relationship.
  • a state determining whether the second network node is a network node that causes a network node that issues the fault alarm to fail, including:
  • the working state of the second network node is an abnormal state, further determining that it does not have a downstream network node directly dependent on the transfer relationship, or having a downstream network node directly dependent on the transfer relationship causes the first network node to occur
  • the suspected root cause of the fault is greater than the first threshold.
  • the second network node is determined to be a network node that causes the first network node to fail.
  • the processor 82 locates from each of the other network nodes included in the dependency chain according to the working state of each of the other network nodes included in the dependency chain.
  • the network node where the first network node fails includes:
  • the determined number of the dependent chains is at least two, respectively calculating a suspected degree value of the faulty root cause of the failure of the first network node by the downstream network node having the direct dependency relationship with the first network node ;
  • the processor 82 calculates a suspicion value of a fault source originating from a downstream network node having a direct dependent transfer relationship with the first network node, causing the first network node to fail:
  • the network nodes belonging to the same fault alarm level are respectively counted. Two values;
  • the processor 82 calculates a direct dependency relationship with the first network node according to the determined total value, the first value, and the second value.
  • the suspicion value of the faulty root cause of the failure of the first network node by the downstream network node including:
  • the S 1i is a suspicion value of the root cause of the failure of the network node that issues the fault alarm by the first i-downstream network node that has a direct dependency relationship with the first network node, and the value range of i 1 to N, N is a natural number, m 1i is a total value of an upstream network node having a dependent transfer relationship with the downstream network node, and n 1i is a network node whose working state is in an abnormal state in the upstream network node
  • the first value, w 1i is the second value of the network node belonging to the same level of fault alarm level.
  • the device that locates the fault may determine that at least one of the faults is caused by the local fault alarm or the global fault alarm in the system, and the at least one fault is determined by the type of the transmission relationship between the different network nodes and the type of the alarm generated.
  • the dependency chain of the alarmed network node based on the dependency chain, according to the working state of each of the other network nodes included in the dependency chain, positioning from each of the other network nodes included in the dependency chain
  • the faulty network node flexibly locates the root cause of the fault according to the dependency transfer relationship between the network nodes, effectively avoids the problem of low positioning fault efficiency caused by the preset fault rule, and improves the efficiency of the root cause of the fault.
  • embodiments of the present invention can be provided as a method, apparatus (device), or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the present invention may employ computer usable storage media (including but not limited to disk storage) in one or more of the computer usable program code embodied therein. The form of a computer program product implemented on a device, a CD-ROM, an optical memory, or the like.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明公开了一种定位故障的方法和设备,包括:接收至少一个故障告警;根据故障告警中包含的发出故障告警的第一网络节点的标识和第一网络节点发出故障告警的告警类型,查找满足第一网络节点发出故障告警的告警类型对应的依赖规则且包含第一网络节点的依赖传递关系;根据所述依赖传递关系,确定包含第一网络节点和与第一网络节点具备依赖传递关系的其他网络节点的依赖链;根据依赖链中包含的各个其他网络节点的工作状态,从依赖链中包含的各个所述其他网络节点中,定位导致第一网络节点发生故障的网络节点,有效地避免依据预设故障规则导致的定位故障效率低的问题,提升定位故障发生根源的效率。

Description

一种定位故障的方法和设备 技术领域
本发明涉及计算机技术领域,尤其涉及一种定位故障的方法和设备。
背景技术
在数据中心维护中,当设备发生故障导致设备功能失效时,除了发生故障的设备会发出告警之外,与发生故障设备具备关联关系的设备也会发出告警。这样大量的告警将降低维护人员确定故障发生根源的效率,额外增加维护人员修复故障的时间。
目前,为了快速定位故障根源,提出了基于故障规则关联告警的方案。该方案预设故障规则,将发生的所有告警导入故障规则引擎中,由故障规则引擎利用预设的故障规则定位发生故障的根源。
例如:设备1、设备2和设备3同时发生故障时,根据预设的故障规则(例如:设备1发生故障会导致故障2发生故障、故障2发生故障会导致故障3发生故障),定位发生故障的根源为设备1发生故障。
仍以上述事实为例,设备1、设备2和设备3同时发生故障时,根据预设的故障规则(例如:包含设备1发生故障会导致故障2发生故障,而没有包含故障2发生故障会导致故障3发生故障),此时根据目前确定故障根源的方式,只能确定设备1为设备2发生故障的根源,但是无法确定设备1是否也是导致设备3发生故障的根源。
由此可见,目前故障根源的确定方式依据预设的故障规则,相对比较固定,使得定位故障的效率比较低。
发明内容
有鉴于此,本发明实施例提供了一种定位故障的方法和设备,用以解决目前故障定位效率低的问题。
第一方面,提供了一种定位故障的方法,包括:
接收至少一个故障告警,其中,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种;
根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系;
根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链,其中,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关系,所述依赖传递关系包括连接关系、包含关系中的至少一种;
根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
结合第一方面可能的实施方式,在第一方面的第一种可能的实施方式中,所述依赖规则包含应用类型依赖于链路类型,链路类型依赖于设备类型中的至少一种;
根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系,包括:
根据所述故障告警中包含的所述第一网络节点发出故障告警的告警类型,确定满足所述告警类型的依赖规则;
根据所述依赖规则和所述故障告警中包含的发出故障告警的第一网络节点的标识,查找包含所述第一网络节点的依赖传递关系。
结合第一方面可能的实施方式,或者结合第一方面的第一种可能的实施方式,在第一方面的第二种可能的实施方式中,根据所述依赖链中包含的各个所 述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点,包括:
从所述依赖链中处于最上游的网络节点开始,依次执行以下操作,直至所述依赖链中包含的各个所述其他网络节点执行完毕结束:
确定执行本轮操作的第二网络节点;
根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致所述第一网络节点发生故障的网络节点;
若判断结果为是时,则将所述第二网络节点写入故障根源列表中,继续选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
若判断结果为否时,则选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
在所述依赖链中包含的各个所述其他网络节点执行完毕时,将所述故障根源列表中包含的网络节点定位为导致所述第一网络节点发生故障的网络节点;
其中,所述依赖链中处于最上游的网络节点是指根据依赖传递关系,在所述依赖链中,该网络节点依赖于所述依赖链中除了该网络节点之外的其他网络节点。
结合第一方面的第二种可能的实施方式,在第一方面的第三种可能的实施方式中,根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致发出所述故障告警的网络节点发生故障的网络节点,包括:
在所述第二网络节点的工作状态为非正常状态时,进一步确定其不具备直接依赖传递关系的下游网络节点,或者其具备直接依赖传递关系的下游网络节点的导致所述第一网络节点发生故障的故障根源的嫌疑程度值大于设定第一阈 值时,确定所述第二网络节点为导致所述第一网络节点发生故障的网络节点。
结合第一方面可能的实施方式,或者结合第一方面的第一种可能的实施方式,或者结合第一方面的第二种可能的实施方式,或者结合第一方面的第三种可能的实施方式,在第一方面的第四种可能的实施方式中,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点,包括:
若确定的所述依赖链的个数为至少两个时,分别计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值;
从计算得到的多个故障根源的嫌疑程度值中选择数值大于设定第二阈值的故障根源的嫌疑程度值;
根据选择的所述故障根源的嫌疑程度值,确定包含所述故障根源的嫌疑程度值对应的所述下游网络节点和所述第一网络节点的依赖链;
基于确定的所述依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
结合第一方面的第三种可能的实施方式,或者结合第一方面的第四种可能的实施方式,在第一方面的第五种可能的实施方式中,计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
确定与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值;以及确定所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值;
对于所述上游网络节点中工作状态处于非正常状态的网络节点,根据各个所述处于非正常状态的网络节点发出故障告警的程度值,分别统计得到属于同一种故障告警程度级别的网络节点的第二个数值;
根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值。
结合第一方面的第五种可能的实施方式,在第一方面的第六种可能的实施方式中,根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,包括:
通过以下方式计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
Figure PCTCN2016080096-appb-000001
其中,S1i为计算得到与所述第一网络节点具备直接依赖传递关系的该第1i下游网络节点导致发出所述故障告警的网络节点发生故障的故障根源的嫌疑程度值,i的取值范围为1至N,N为自然数,m1i为与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值,n1i为所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值,w1i为属于同一种故障告警程度级别的网络节点的第二个数值。
第二方面,提供了一种定位故障的设备,包括:
接收单元,用于接收至少一个故障告警,其中,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种;
查找单元,用于根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系;
确定单元,用于根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链,其中,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关系,所述依赖传递关系包括连接关系、包含关系中的至少一种;
定位单元,用于根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
结合第二方面可能的实施方式,在第二方面的第一种可能的实施方式中,所述依赖规则包含应用类型依赖于链路类型,链路类型依赖于设备类型中的至少一种;
所述查找单元,具体用于根据所述故障告警中包含的所述第一网络节点发出故障告警的告警类型,确定满足所述告警类型的依赖规则;
根据所述依赖规则和所述故障告警中包含的发出故障告警的第一网络节点的标识,查找包含所述第一网络节点的依赖传递关系。
结合第二方面可能的实施方式,或者结合第二方面的第一种可能的实施方式,在第二方面的第二种可能的实施方式中,所述定位单元,具体用于从所述依赖链中处于最上游的网络节点开始,依次执行以下操作,直至所述依赖链中包含的各个所述其他网络节点执行完毕结束:
确定执行本轮操作的第二网络节点;
根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致所述第一网络节点发生故障的网络节点;
若判断结果为是时,则将所述第二网络节点写入故障根源列表中,继续选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
若判断结果为否时,则选择与所述第二网络节点具备直接依赖传递关系的 下游网络节点为执行下一轮操作的网络节点;
在所述依赖链中包含的各个所述其他网络节点执行完毕时,将所述故障根源列表中包含的网络节点定位为导致所述第一网络节点发生故障的网络节点;
其中,所述依赖链中处于最上游的网络节点是指根据依赖传递关系,在所述依赖链中,该网络节点依赖于所述依赖链中除了该网络节点之外的其他网络节点。
结合第二方面的第二种可能的实施方式,在第二方面的第三种可能的实施方式中,所述定位单元,具体用于在所述第二网络节点的工作状态为非正常状态时,进一步确定其不具备直接依赖传递关系的下游网络节点,或者其具备直接依赖传递关系的下游网络节点的导致所述第一网络节点发生故障的故障根源的嫌疑程度值大于设定第一阈值时,确定所述第二网络节点为导致所述第一网络节点发生故障的网络节点。
结合第二方面可能的实施方式,或者结合第二方面的第一种可能的实施方式,或者结合第二方面的第二种可能的实施方式,或者结合第二方面的第三种可能的实施方式,在第二方面的第四种可能的实施方式中,所述定位单元,具体用于若确定的所述依赖链的个数为至少两个时,分别计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值;
从计算得到的多个故障根源的嫌疑程度值中选择数值大于设定第二阈值的故障根源的嫌疑程度值;
根据选择的所述故障根源的嫌疑程度值,确定包含所述故障根源的嫌疑程度值对应的所述下游网络节点和所述第一网络节点的依赖链;
基于确定的所述依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
结合第二方面的第三种可能的实施方式,或者结合第二方面的第四种可能 的实施方式,在第二方面的第五种可能的实施方式中,所述定位单元计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,具体用于:
确定与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值;以及确定所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值;
对于所述上游网络节点中工作状态处于非正常状态的网络节点,根据各个所述处于非正常状态的网络节点发出故障告警的程度值,分别统计得到属于同一种故障告警程度级别的网络节点的第二个数值;
根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值。
结合第二方面的第五种可能的实施方式,在第二方面的第六种可能的实施方式中,所述定位单元根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,具体用于:
通过以下方式计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
Figure PCTCN2016080096-appb-000002
其中,S1i为计算得到与所述第一网络节点具备直接依赖传递关系的该第1i下游网络节点导致发出所述故障告警的网络节点发生故障的故障根源的嫌疑程度值,i的取值范围为1至N,N为自然数,m1i为与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值,n1i为所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值,w1i为属于同一种故障告警程度级别的网 络节点的第二个数值。
本发明有益效果如下:
本发明实施例接收至少一个故障告警,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种;根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系;根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关系,所述依赖传递关系包括连接关系、包含关系中的至少一种;根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点,也就是说,在本发明实施例中,不管系统中发生局部故障告警,还是全局故障告警,通过各个不同网络节点之间的依赖传递关系和发生告警的类型,可以确定出至少一个包含发出故障告警的网络节点的依赖链,那么基于该依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致发生故障的网络节点,灵活地根据网络节点之间的依赖传递关系定位故障根源,有效地避免依据预设故障规则导致的定位故障效率低的问题,提升定位故障发生根源的效率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的 一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种定位故障的方法的流程示意图;
图2为本发明实施例提供的一种定位故障的方法的流程示意图;
图3为发生故障告警的网络拓扑图;
图4为发生故障告警的网络拓扑图;
图5为发生故障告警的网络拓扑图;
图6为发生故障告警的网络拓扑图;
图7为本发明实施例提供的一种定位故障的设备的结构示意图;
图8为本发明实施例提供的一种定位故障的设备的结构示意图。
具体实施方式
为了实现本发明的目的,本发明实施例提供了一种定位故障的方法和设备,接收至少一个故障告警,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种;根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系;根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关系,所述依赖传递关系包括连接关系、包含关系中的至少一种;根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点,也就是说,在本发明实施例中,不管系统中发生局部故障告警,还是全局 故障告警,通过各个不同网络节点之间的依赖传递关系和发生告警的类型,可以确定出至少一个包含发出故障告警的网络节点的依赖链,那么基于该依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致发生故障的网络节点,灵活地根据网络节点之间的依赖传递关系定位故障根源,有效地避免依据预设故障规则导致的定位故障效率低的问题,提升定位故障发生根源的效率。
需要说明的是,所述依赖传递关系包含直接依赖传递关系、间接依赖传递关系。
其中,所述直接依赖传递关系是指依赖者与被依赖者之间的直接依赖传递关系;所述间接依赖传递关系是指基于依赖者、第一级被依赖者和第二级被依赖者,依赖者与第二级被依赖者之间的依赖关系被称为间接依赖传递关系。
还需要说明的是,所述依赖传递关系包括连接关系、包含关系中的至少一种。
这里还需要说明的是,对于一个依赖链,处于依赖链最上层的对象被称为最上游的依赖对象,依赖链中的其他对象都属于该最上游的依赖对象的被依赖者。
下面结合说明书附图对本发明各个实施例作进一步地详细描述。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。
图1为本发明实施例提供的一种定位故障的方法的流程示意图。所述方法可以如下所述。
步骤101:接收至少一个故障告警。
其中,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种。
在步骤101中,控制设备接收不同网络节点发出的故障告警,根据接收到的故障告警,确定发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,例如:属于应用类型的故障告警,还是属于链路类型的故障告警,还是属于设备类型的故障告警等等。
步骤102:根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系。
在步骤102中,由于不同网络节点之间的依赖传递关系可以通过数据中心Topo模型抽象成网络拓扑图体现,网络拓扑图中各个网络节点之间存在依赖传递关系,这种依赖传递关系可以包括网络、应用-虚拟机-物理设备之间的映射关系,也可以包括不同网络层之间的链路关系,例如:2层链路,3层链路等。
例如:应用-虚拟机-物理设备之间的依赖传递关系:应用依赖在虚拟机上运行,虚拟机依赖在物理设备上运行。也就是说,应用属于虚拟机的上层依赖对象,虚拟机属于应用的下层依赖对象;虚拟机属于物理设备的上层依赖对象;物理设备属于虚拟机的下层依赖对象。
需要说明的是,本发明实施例中所描述的依赖传递关系包括连接关系,例如:网络节点1与网络节点2之间建立数据连接,那么网络节点1与网络节点2之间的依赖传递关系为连接关系;还包括包含关系,例如:应用运行在虚拟机上,可以说明应用包含的虚拟机上,那么应用与虚拟机之间的依赖传递关系为包含关系。
控制设备在接收到故障告警时,根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系。
其中,所述依赖规则包含应用类型依赖于链路类型,链路类型依赖于设备 类型中的至少一种。
具体地,根据所述故障告警中包含的所述第一网络节点发出故障告警的告警类型,确定满足所述告警类型的依赖规则;
根据所述依赖规则和所述故障告警中包含的发出故障告警的第一网络节点的标识,查找包含所述第一网络节点的依赖传递关系。
例如:接收到故障告警中包含的第一网络节点的标识为11以及所述第一网络节点发出故障告警的告警类型为应用类型,那么确定满足所述告警类型的依赖规则为:应用类型依赖于链路类型,链路类型依赖于设备类型。
即在第一网络节点上找到发出故障告警的应用,以该应用为起点,确定包含该应用的依赖传递关系,例如:第一网络节点上的应用依赖第一网络节点上的虚拟机;第一网络节点上的虚拟机依赖第一网络节点的物理设备等。
例如:接收到故障告警中包含的第一网络节点的标识为11以及所述第一网络节点发出故障告警的告警类型为链路类型,那么确定满足所述告警类型的依赖规则为:链路类型依赖于设备类型。
即在第一网络节点上找到发出故障告警的链路1,以该链路1为起点,确定包含该链路1的依赖传递关系,例如:第一网络节点上的链路1依赖第一网络节点与其他网络节点之间的链路2;链路1依赖其他网络节点等。
步骤103:根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链。
其中,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关系,所述依赖传递关系包括连接关系、包含关系中的至少一种。
在步骤103中,根据确定的所述依赖传递关系,确定发出故障告警的第一网络节点以及告警类型,以该第一网络节点为起点,建立包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链。
例如:假设存在应用1、应用2、虚拟机3和物理设备4,他们之间的依赖 传递关系可以表示为:应用1依赖于应用2;应用2依赖于虚拟机3;虚拟机3依赖于物理设备4。若应用1发出故障告警,那么根据它们之间的依赖传递关系,确定包含应用1以及与应用1具备传递依赖关系的应用2、虚拟机3和物理设备4的依赖链,即应用1-应用2-虚拟机3-物理设备4。
具体地,在接收到故障告警时,首先,确定接收到的所述故障告警的告警类型。
其次,根据所述告警类型,确定所要查找的依赖传递关系的起点。
最后,根据配置的不同网络节点之间的依赖传递关系,从确定的所述起点开始建立包含该起点和与该起点具备依赖传递关系的其他网络节点的依赖链。
这里所述的依赖传递关系包含直接依赖传递关系、间接依赖传递关系中的至少一种。
需要说明的是,由于接收到故障告警的个数不限于1个,当同一时间内接收故障告警的个数为多个时,可以按照告警类型对接收到的故障告警进行划分,这里想说明的是,在确定依赖链时,优先建立包含发出的故障告警属于应用类型的告警的网络节点的依赖链,即若接收到的故障告警中既包含应用类型的故障告警,也包含设备类型的故障告警,此时首先以发出应用类型的故障告警的网络节点为起点确定依赖链。
还需要说明的是,与发出所述故障告警的网络节点具备直接依赖传递关系和/或间接依赖传递关系的其他网络节点,可以包含其他发出故障告警的网络节点,也可以不包含其他发出故障告警的网络节点,是否包含依赖于与发出所述故障告警的网络节点之间的依赖传递关系。
步骤104:根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
在步骤104中,针对步骤103中确定的每一个依赖链,按照以下方式确定该依赖链中导致所述第一网络节点发生故障的网络节点,进而定位得到导致发 出故障告警的故障根源,将定位得到的网络节点作为导致所述第一网络节点发生故障的故障根源。
具体地,针对每一个依赖链,从所述依赖链中处于最上游的网络节点开始,依次执行以下操作,直至所述依赖链中包含的各个所述其他网络节点执行完毕结束:
第一步,确定执行本轮操作的第二网络节点。
第二步,根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致所述第一网络节点发生故障的网络节点,若是,则执行第四步;若否,则执行第三步。
需要说明的是,工作状态可以为运行正常状态,也可以为非正常运行状态,这里的非正常运行状态又可以被称为失能状态。
这里失能状态可以包括链路不通,也可以包括设备功能失效,还可以包括设备功能部分失效等。
例如:交换机有48个端口,其中,10个端口不能使用,这说明交换机的功能部分失效,对于连接不能使用的这10个端口的链路,工作状态为失能状态。
具体地,在确定的所述第二网络节点的工作状态为非正常状态时,进一步确定其不具备直接依赖传递关系的下游网络节点,或者其具备直接依赖传递关系的下游网络节点的导致所述第一网络节点发生故障的故障根源的嫌疑程度值大于设定第一阈值时,确定所述第二网络节点为导致所述第一网络节点发生故障的网络节点。
此外,在确定的所述第二网络节点的工作状态为非正常状态时,进一步确定其具备直接依赖传递关系的下游网络节点,和/或其具备直接依赖传递关系的下游网络节点的导致所述第一网络节点发生故障的故障根源的嫌疑程度值不大于设定第一阈值时,确定所述第二网络节点不为导致所述第一网络节点发生故障的网络节点。
第三步,若判断结果为否时,则选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点。
第四步,若判断结果为是时,则将所述第二网络节点写入故障根源列表中。
此时,若确定的所述网络节点的上游网络节点也已被写入故障根源列表中,则在得到所述第二网络节点为导致所述第一网络节点发生故障的网络节点时,将所述第二网络节点的上游网络节点从故障根源列表中移出。
进一步地判断所述第二网络节点是否为该依赖链的最后一个网络节点,若是,则结束上述操作;若否,则选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点,继续执行上述操作。
在所述依赖链中包含的各个所述其他网络节点执行完毕时,将所述故障根源列表中包含的网络节点定位为导致所述第一网络节点发生故障的网络节点。
需要说明的是,所述依赖链中处于最上游的网络节点是指根据依赖传递关系,在所述依赖链中,该网络节点依赖于所述依赖链中除了该网络节点之外的其他网络节点。
可选地,若确定的所述依赖链的个数为至少两个时,分别计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值;
从计算得到的多个故障根源的嫌疑程度值中选择数值大于设定第二阈值的故障根源的嫌疑程度值;
根据选择的所述故障根源的嫌疑程度值,确定包含所述故障根源的嫌疑程度值对应的所述下游网络节点和所述第一网络节点的依赖链;
基于确定的所述依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
例如:以应用1为起点的依赖链存在两条,第一条依赖链可表示为:应用1-应用2-虚拟机3-物理设备4;第二条依赖链可表示为:应用1-应用3-虚拟机 5-物理设备6。此时,分别计算应用2导致应用1发生故障的故障根源的嫌疑程度值1以及计算应用3导致应用1发生故障的故障根源的嫌疑程度值2。
从嫌疑程度值1和嫌疑程度值2中选择一个大于设定第二阈值的嫌疑程度值,若嫌疑程度值2大于设定第二阈值,那么根据嫌疑程度值2,确定包含嫌疑程度值2对应的应用3和应用1的依赖链,针对该依赖链,确定导致应用1发出故障告警的故障根源。
可选地,计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,包括:
确定与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值;以及确定所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值;
对于所述上游网络节点中工作状态处于非正常状态的网络节点,根据各个所述处于非正常状态的网络节点发出故障告警的程度值,分别统计得到属于同一种故障告警程度级别的网络节点的第二个数值;
根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值。
需要说明的是,故障告警的程度可以分为高、中、低三种类型,在网络节点发生故障时,可以根据发生故障的程度选择不同告警程度进行告警。这样,对于非正常工作状态的网络节点,可以根据告警程度的不同,统计同一种类型的告警程度的网络节点的个数。
可选地,根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,包括:
通过以下方式计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
Figure PCTCN2016080096-appb-000003
其中,S1i为计算得到与所述第一网络节点具备直接依赖传递关系的该第1i下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,i的取值范围为1至N,N为自然数,m1i为与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值,n1i为所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值,w1i为属于同一种故障告警程度级别的网络节点的第二个数值。
需要说明的是,“1i”中的“1”是指发出故障告警的网络节点,“i”是指与所述发出故障告警的网络节点之间具备直接依赖关系的第i个网络节点,i的取值范围为1至N,N为自然数。
需要说明的是,本发明实施例中所述的“第一网络节点”和“第二网络节点”中包含的“第一”、“第二”没有实质含义,仅仅用于表明两个不同的网络节点。
通过本发明实施例的方案,接收至少一个故障告警,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种;根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系;根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关系,所述依赖传递关系包括连接关系、包含关系中的至少一种;根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一 网络节点发生故障的网络节点,也就是说,在本发明实施例中,不管系统中发生局部故障告警,还是全局故障告警,通过各个不同网络节点之间的依赖传递关系和发生告警的类型,可以确定出至少一个包含发出故障告警的网络节点的依赖链,那么基于该依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致发生故障的网络节点,灵活地根据网络节点之间的依赖传递关系定位故障根源,有效地避免依据预设故障规则导致的定位故障效率低的问题,提升定位故障发生根源的效率。
图2为本发明实施例提供的一种定位故障的方法的流程示意图。在图1所示的定位故障根源的发明构思的基础之上,预先根据依赖关系确定几个定位故障规则,在接收到故障告警时,根据各个网络节点与发出故障告警的网络节点之间的依赖关系和确定的定位故障规则,定位导致所述发出故障告警的网络节点发生故障的故障根源。所述方法可以如下所述。
定位故障规则一:上层依赖对象发生故障,从发生故障的上层依赖对象的下层依赖对象中定位故障根源。
定位故障规则二:从发生故障的上层依赖对象的下层依赖对象中定位故障根源的方式:计算发生故障的下层依赖对象导致故障根源的嫌疑程度值。
定位故障规则三:对于发生故障的对象,若满足该发生故障的对象导致其上层依赖对象部分或者全部发生故障,且该发生故障的对象没有发生故障的下层依赖对象,则确定该发生故障的对象为故障根源。
步骤201:在接收到故障告警时,根据系统的网络拓扑图和定位故障规则一,得到导致发生故障告警的故障嫌疑列表。
其中,所述故障嫌疑列表中包含发出故障告警的故障对象和工作状态为非正常运行状态的故障对象。
例如:得到故障嫌疑列表St={A1、A2、A3、……、Ap、……、Aq},其中,t、p和q为自然数。
步骤202:从故障嫌疑列表中选择一个故障对象作为本轮推理对象。
在步骤202中,假设从St={A1、A2、A3、……、Ap、……、Aq}中选择的一个故障对象Ap
步骤203:根据定位故障规则二,计算故障对象Ap成为故障根源的嫌疑程度值。
在步骤203中,计算故障对象Ap成为故障根源的嫌疑程度值,具体包括:
首先,确定与该故障对象Ap具备上层依赖关系的依赖对象的总个数值。
其次,确定与该故障对象Ap具备上层依赖关系的依赖对象中工作状态为非正常工作状态的依赖对象的第一个数值。
再次,对于与该故障对象Ap具备上层依赖关系的依赖对象中工作状态为非正常工作状态的依赖对象,根据各个依赖对象发出故障告警的程度,得到发出每一种类型的故障告警程度对应的依赖对象的第二个数值。
最后,根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到该故障对象Ap成为导致发生故障的故障根源的嫌疑程度值。
具体地,通过以下方式计算得到该故障对象Ap成为导致发生故障的故障根源的嫌疑程度值:
Figure PCTCN2016080096-appb-000004
其中,Sp为计算得到该故障对象Ap成为导致发生故障的故障根源的嫌疑程度值,p的取值范围为1至q,q为自然数,mp为与该故障对象Ap具备上层依赖关系的故障对象的总个数值,np为与该故障对象Ap具备上层依赖关系的故障对象中工作状态为非正常工作状态的故障对象的第一个数值,wp为发生每一种故障告警程度对应的故障对象的第二个数值。
步骤204:判断该故障对象Ap是否存在发生故障的下层依赖对象,若存在,则执行步骤206;若不存在,则执行步骤205。
步骤205:在确定选择的故障对象Ap导致其上层依赖对象部分或者全部故障,且不存在发生故障的下层依赖对象,根据定位故障规则三,将故障对象Ap写入故障根源嫌疑列表中。
在步骤205中,将故障对象Ap从故障嫌疑列表中移出,并将故障对象Ap写入故障根源嫌疑列表中,进一步判断故障嫌疑列表中是否还存在未被推理的故障对象,若存在,则跳转执行步骤202,若不存在,则执行步骤208。
步骤206:在确定选择的故障对象Ap导致其上层依赖对象部分或者全部故障时,且存在发生故障的下层依赖对象时,将故障对象Ap的上层依赖对象移出故障嫌疑列表。
步骤207:分别按照依赖关系,计算故障对象Ap的发生故障的下层直接依赖对象成为发生故障的故障根源的嫌疑程度值以及故障对象Ap的发生故障的下层间接依赖对象成为发生故障的故障根源的嫌疑程度值,直至故障对象Ap的发生故障的下层间接依赖对象不再存在发生故障的下层依赖对象时,将计算得到的嫌疑程度值最大的故障对象写入故障根源嫌疑列表中,跳转执行步骤202。
步骤208:从故障根源嫌疑列表中定位出导致发生故障告警的故障根源。
在步骤208中,从故障根源嫌疑列表中,选择出嫌疑程度值最大的故障对象作为导致发生故障告警的故障根源。
图3为发生故障告警的网络拓扑图,本发明实施例以图3所示的以多链路故障为例提供一种定位故障根源的方法。以说明多链路故障情形下定位故障根源的方法。所述方法可以如下所述。
从图3中可以看出,SW1端口A失能、SW2端口A失能。预处理结果是:Link13导致SW1和SW3失能,Link22导致SW2和Host2失能。按照上述方式依次确定导致SW1端口A失能的故障根源以及确定导致SW2端口A失能的故障根源。
第一步,分别以SW1端口A失能为起点,确定包含SW1端口A和与SW1端口A之间具备依赖传递关系的其他网络节点的依赖链11,以及以SW2端口A失能为起点,确定包含SW2端口A和与SW2端口A之间具备依赖传递关系的其他 网络节点的依赖链12。
具体地,以SW1端口A失能为起点确定的依赖链11中包含通过SW1端口A与SW3连接的Link13、SW1和SW3,其中,SW1端口A与Link13之间具备直接依赖传递关系,Link13与SW1和SW3之间具备直接依赖传递关系。
以SW2端口A失能为起点确定的依赖链12中包含通过SW2端口A与SW3连接的Link22、SW2和Host2,其中,SW2端口A与Link23之间具备直接依赖传递关系,Link22与SW2和Host2之间具备直接依赖传递关系。
第二步,由于Link13导致SW1端口A失能,但是SW3运行正常,那么根据定位故障根源规则三,确定Link13为导致SW1端口A失能的故障根源。
由于Link22导致SW2端口A失能,但是Host2运行正常,那么根据定位故障根源规则三,确定Link22为导致SW2端口A失能的故障根源。
图4为发生故障告警的网络拓扑图。本发明实施例以图4所示的以设备故障(无失联)为例提供一种定位故障根源的方法。以说明设备故障(无失联)情形下定位故障根源的方法。所述方法可以如下所述。
从图4中可以看出,SW1端口B发出故障告警、SW3端口A发出故障告警。预处理结果是:Link12导致SW1和SW2失能;Link23导致SW2和SW3失能。按照上述方式依次确定导致SW1端口B发出故障告警和导致SW3端口A发出故障告警的故障根源。
第一步,分别以SW1端口B为起点,确定包含SW1端口B和与SW1端口B之间具备依赖传递关系的其他网络节点的依赖链21,以及以SW3端口A为起点,确定包含SW3端口A和与SW3端口A之间具备依赖传递关系的其他网络节点的依赖链22。
具体地,以SW1端口B为起点确定的依赖链21中包含通过SW1端口B与SW2连接的Link12、SW1和SW2,其中,SW1端口A与Link12之间具备直接依赖传递关系,Link12与SW1和SW3之间具备直接依赖传递关系。
以SW3端口A失能为起点确定的依赖链22中包含通过SW3端口A与SW2连 接的Link23、SW2和SW3,其中,SW3端口A与Link23之间具备直接依赖传递关系,Link23与SW2和SW3之间具备直接依赖传递关系。
第二步,根据接收到的故障告警,确定导致发生故障告警的故障嫌疑列表。
其中,故障嫌疑列表中包含:Link12、Link23和SW2。
第三步,分别计算Link12导致SW1端口B发出故障告警的嫌疑程度值以及计算Link23导致SW3端口A发出故障告警的嫌疑程度值。
第四步,在确定Link12和Link23的发生故障的依赖对象为SW2时,计算SW2导致发出告警的嫌疑程度值。
第五步,在确定SW2导致发出告警的嫌疑程度值分别大于计算Link12导致SW1端口B发出故障告警的嫌疑程度值以及计算Link23导致SW3端口A发出故障告警的嫌疑程度值时,排除Link12和Link23导致发出告警的嫌疑。
第六步,由于SW2导致上层依赖对象部分或者全部失能,且不存在下层依赖对象,那么根据定位故障规则三,确定SW2为导致SW1端口B告警和SW3端口A告警的故障根源。
图5为发生故障告警的网络拓扑图。本发明实施例以图5所示的以设备故障(失联)为例提供的一种定位故障根源的方法。以说明设备故障(失联)情形下定位故障根源的方法。所述方法可以如下所述。
从图5中可以看出,SW1端口B发出故障告警、SW2失联。预处理结果是:Link2导致SW1和SW2失能;SW2导致L3失联。按照上述方式依次确定导致SW1端口B发出故障告警和导致SW2失联的故障根源。
第一步,分别以SW1端口B为起点,确定包含SW1端口B和与SW1端口B之间具备依赖传递关系的其他网络节点的依赖链31,以及以SW2为起点,确定包含SW2和与SW2之间具备依赖传递关系的其他网络节点的依赖链32。
具体地,以SW1端口B为起点确定的依赖链31中包含Link01、通过SW1端口B与SW2连接的Link12、通过SW1端口B建立M0与SW2连接的IP02、SW1和SW2,其中,SW1端口B与Link12、与IP02之间具备直接依赖传递关系,Link12 与SW1和SW2之间具备直接依赖传递关系。
以SW2为起点确定的依赖链32中包含SW2。
第二步,分别计算Link01和Link12导致发出故障告警的嫌疑程度值。
其中,计算得到Link01的嫌疑程度值为0,计算得到Link12的嫌疑程度值为100。
第三步,由于Link12的嫌疑程度值大于Link01的嫌疑程度值,排除IP02成为故障根源的嫌疑。
第四步,以Link12为推理对象,Link12的下层依赖对象包含SW1和SW2,分别计算SW1和SW2导致发出告警的故障根源。
其中,计算得到的SW1导致发出告警的故障根源为0;计算得到的SW2导致发出告警的故障根源为φ。
第五步,由于SW2为Link12的下层依赖对象,且SW2失联,那么以SW2为推理对象。
第四步,由于SW2的嫌疑程度值大于SW1的嫌疑程度值,排除SW1成为故障根源的嫌疑。由于SW2失联且不存在下层依赖对象,根据定位故障规则三,确定SW2为导致发出告警的故障根源。
图6为发生故障告警的网络拓扑图。本发明实施例以图6所示的以设备故障(失联)为例提供的一种定位故障根源的方法。以说明设备故障(失联)情形下定位故障根源的方法。所述方法可以如下所述。
从图6中可以看出,SW1端口B发出故障告警,SW2、host4和host5失联。按照上述方式依次确定导致SW1端口B发出故障告警和导致SW2、host4和host5失联的故障根源。
第一步,根据依赖关系,SW1与SW2之间存在Link12,那么Link12为SW1的下层直接依赖对象。
由于SW2失联,那么IP02为SW2的下层直接依赖对象。
又由于IP02包含Link01和Link12。
由于host4失联,那么IP04为host4的下层直接依赖对象。
又由于IP04包含Link01、Link12和link24。
由于host5失联,那么IP05为host5的下层直接依赖对象。
又由于IP05包含Link01、Link12和link25。
第二步,针对IP02,分别计算Link01和Link12导致发出故障告警的嫌疑程度值;针对IP04,分别计算Link01、Link12和link24导致发出故障告警的嫌疑程度值;针对IP05,分别计算Link01、Link12和link25导致发出故障告警的嫌疑程度值。
其中,计算得到Link01的嫌疑程度值为2,计算得到Link12的嫌疑程度值为102,计算得到Link24的嫌疑程度值为0,计算得到Link25的嫌疑程度值为0。
第三步,由于Link12的嫌疑程度值最大,排除IP02、IP04和IP05成为故障根源的嫌疑。
第四步,以Link12为推理对象,Link12的下层依赖对象为SW1和SW2,分别计算SW1和SW2导致发出故障告警的嫌疑程度值。
其中,计算得到的SW1导致发出故障告警的嫌疑程度值为0;计算得到的SW2导致发出故障告警的嫌疑程度值0。
第五步,根据计算结果,确定Link12为导致SW1端口B发出故障告警和导致SW2、host4和host5失联的故障根源。
如图7所示,为本发明实施例提供的一种定位故障的设备的结构示意图。所示设备包括:接收单元71、查找单元72、确定单元73和定位单元74,其中:
接收单元71,用于接收至少一个故障告警,其中,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种;
查找单元72,用于根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网 络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系;
确定单元73,用于根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链,其中,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关系,所述依赖传递关系包括连接关系、包含关系中的至少一种;
定位单元74,用于根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
可选地,所述依赖规则包含应用类型依赖于链路类型,链路类型依赖于设备类型中的至少一种;
所述查找单元72,具体用于根据所述故障告警中包含的所述第一网络节点发出故障告警的告警类型,确定满足所述告警类型的依赖规则;
根据所述依赖规则和所述故障告警中包含的发出故障告警的第一网络节点的标识,查找包含所述第一网络节点的依赖传递关系。
具体地,所述定位单元74,具体用于从所述依赖链中处于最上游的网络节点开始,依次执行以下操作,直至所述依赖链中包含的各个所述其他网络节点执行完毕结束:
确定执行本轮操作的第二网络节点;
根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致所述第一网络节点发生故障的网络节点;
若判断结果为是时,则将所述第二网络节点写入故障根源列表中,继续选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
若判断结果为否时,则选择与所述第二网络节点具备直接依赖传递关系的 下游网络节点为执行下一轮操作的网络节点;
在所述依赖链中包含的各个所述其他网络节点执行完毕时,将所述故障根源列表中包含的网络节点定位为导致所述第一网络节点发生故障的网络节点;
其中,所述依赖链中处于最上游的网络节点是指根据依赖传递关系,在所述依赖链中,该网络节点依赖于所述依赖链中除了该网络节点之外的其他网络节点。
具体地,所述定位单元74,具体用于在所述第二网络节点的工作状态为非正常状态时,进一步确定其不具备直接依赖传递关系的下游网络节点,或者其具备直接依赖传递关系的下游网络节点的导致所述第一网络节点发生故障的故障根源的嫌疑程度值大于设定第一阈值时,确定所述第二网络节点为导致所述第一网络节点发生故障的网络节点。
具体地,所述定位单元74,具体用于若确定的所述依赖链的个数为至少两个时,分别计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值;
从计算得到的多个故障根源的嫌疑程度值中选择数值大于设定第二阈值的故障根源的嫌疑程度值;
根据选择的所述故障根源的嫌疑程度值,确定包含所述故障根源的嫌疑程度值对应的所述下游网络节点和所述第一网络节点的依赖链;
基于确定的所述依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
具体地,所述定位单元74计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,具体用于:
确定与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值;以及确定所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数 值;
对于所述上游网络节点中工作状态处于非正常状态的网络节点,根据各个所述处于非正常状态的网络节点发出故障告警的程度值,分别统计得到属于同一种故障告警程度级别的网络节点的第二个数值;
根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值。
具体地,所述定位单元74根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,具体用于:
通过以下方式计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
Figure PCTCN2016080096-appb-000005
其中,S1i为计算得到与所述第一网络节点具备直接依赖传递关系的该第1i下游网络节点导致发出所述故障告警的网络节点发生故障的故障根源的嫌疑程度值,i的取值范围为1至N,N为自然数,m1i为与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值,n1i为所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值,w1i为属于同一种故障告警程度级别的网络节点的第二个数值。
需要说明的是,本发明实施例提供的设备可以采用软件方式实现,也可以采用硬件方式实现,这里不做限定。
图8为本发明实施例提供的一种定位故障的设备的结构示意图。本发明实施例中所述的设备可以采用通用计算机结构实现。例如:所述设备包括:信号接收器81和处理器82,其中,所述信号接收器81与所述处理器82之间可以通 过总线83连接。
具体地,信号接收器81,用于接收至少一个故障告警,其中,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种;
所述处理器82,用于根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系;
根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链,其中,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关系,所述依赖传递关系包括连接关系、包含关系中的至少一种;
根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
可选地,所述依赖规则包含应用类型依赖于链路类型,链路类型依赖于设备类型中的至少一种;
在另一个发明实施例中,所述处理器82根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系,包括:
根据所述故障告警中包含的所述第一网络节点发出故障告警的告警类型,确定满足所述告警类型的依赖规则;
根据所述依赖规则和所述故障告警中包含的发出故障告警的第一网络节点的标识,查找包含所述第一网络节点的依赖传递关系。
在另一个发明实施例中,所述处理器82根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点,包括:
从所述依赖链中处于最上游的网络节点开始,依次执行以下操作,直至所述依赖链中包含的各个所述其他网络节点执行完毕结束:
确定执行本轮操作的第二网络节点;
根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致所述第一网络节点发生故障的网络节点;
若判断结果为是时,则将所述第二网络节点写入故障根源列表中,继续选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
若判断结果为否时,则选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
在所述依赖链中包含的各个所述其他网络节点执行完毕时,将所述故障根源列表中包含的网络节点定位为导致所述第一网络节点发生故障的网络节点;
其中,所述依赖链中处于最上游的网络节点是指根据依赖传递关系,在所述依赖链中,该网络节点依赖于所述依赖链中除了该网络节点之外的其他网络节点。
在另一个发明实施例中,所述处理器82根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致发出所述故障告警的网络节点发生故障的网络节点,包括:
在所述第二网络节点的工作状态为非正常状态时,进一步确定其不具备直接依赖传递关系的下游网络节点,或者其具备直接依赖传递关系的下游网络节点的导致所述第一网络节点发生故障的故障根源的嫌疑程度值大于设定第一阈 值时,确定所述第二网络节点为导致所述第一网络节点发生故障的网络节点。
在另一个发明实施例中,所述处理器82根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点,包括:
若确定的所述依赖链的个数为至少两个时,分别计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值;
从计算得到的多个故障根源的嫌疑程度值中选择数值大于设定第二阈值的故障根源的嫌疑程度值;
根据选择的所述故障根源的嫌疑程度值,确定包含所述故障根源的嫌疑程度值对应的所述下游网络节点和所述第一网络节点的依赖链;
基于确定的所述依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
在另一个发明实施例中,所述处理器82计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
确定与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值;以及确定所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值;
对于所述上游网络节点中工作状态处于非正常状态的网络节点,根据各个所述处于非正常状态的网络节点发出故障告警的程度值,分别统计得到属于同一种故障告警程度级别的网络节点的第二个数值;
根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值。
在另一个发明实施例中,所述处理器82根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,包括:
通过以下方式计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
Figure PCTCN2016080096-appb-000006
其中,S1i为计算得到与所述第一网络节点具备直接依赖传递关系的该第1i下游网络节点导致发出所述故障告警的网络节点发生故障的故障根源的嫌疑程度值,i的取值范围为1至N,N为自然数,m1i为与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值,n1i为所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值,w1i为属于同一种故障告警程度级别的网络节点的第二个数值。
在本发明实施例中所述定位故障的设备不管系统中发生局部故障告警,还是全局故障告警,通过各个不同网络节点之间的依赖传递关系和发生告警的类型,可以确定出至少一个包含发出故障告警的网络节点的依赖链,那么基于该依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致发生故障的网络节点,灵活地根据网络节点之间的依赖传递关系定位故障根源,有效地避免依据预设故障规则导致的定位故障效率低的问题,提升定位故障发生根源的效率。
本领域的技术人员应明白,本发明的实施例可提供为方法、装置(设备)、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储 器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、装置(设备)和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (14)

  1. 一种定位故障的方法,其特征在于,包括:
    接收至少一个故障告警,其中,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种;
    根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系;
    根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链,其中,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关系,所述依赖传递关系包括连接关系、包含关系中的至少一种;
    根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
  2. 如权利要求1所述的定位故障的方法,其特征在于,所述依赖规则包含应用类型依赖于链路类型,链路类型依赖于设备类型中的至少一种;
    根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系,包括:
    根据所述故障告警中包含的所述第一网络节点发出故障告警的告警类型,确定满足所述告警类型的依赖规则;
    根据所述依赖规则和所述故障告警中包含的发出故障告警的第一网络节点的标识,查找包含所述第一网络节点的依赖传递关系。
  3. 如权利要求1或2所述的定位故障的方法,其特征在于,根据所述依赖 链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点,包括:
    从所述依赖链中处于最上游的网络节点开始,依次执行以下操作,直至所述依赖链中包含的各个所述其他网络节点执行完毕结束:
    确定执行本轮操作的第二网络节点;
    根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致所述第一网络节点发生故障的网络节点;
    若判断结果为是时,则将所述第二网络节点写入故障根源列表中,继续选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
    若判断结果为否时,则选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
    在所述依赖链中包含的各个所述其他网络节点执行完毕时,将所述故障根源列表中包含的网络节点定位为导致所述第一网络节点发生故障的网络节点;
    其中,所述依赖链中处于最上游的网络节点是指根据依赖传递关系,在所述依赖链中,该网络节点依赖于所述依赖链中除了该网络节点之外的其他网络节点。
  4. 如权利要求3所述的定位故障的方法,其特征在于,根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致发出所述故障告警的网络节点发生故障的网络节点,包括:
    在所述第二网络节点的工作状态为非正常状态时,进一步确定其不具备直接依赖传递关系的下游网络节点,或者其具备直接依赖传递关系的下游网络节点的导致所述第一网络节点发生故障的故障根源的嫌疑程度值大于设定第一阈值时,确定所述第二网络节点为导致所述第一网络节点发生故障的网络节点。
  5. 如权利要求1至4任一项所述的定位故障的方法,其特征在于,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点,包括:
    若确定的所述依赖链的个数为至少两个时,分别计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值;
    从计算得到的多个故障根源的嫌疑程度值中选择数值大于设定第二阈值的故障根源的嫌疑程度值;
    根据选择的所述故障根源的嫌疑程度值,确定包含所述故障根源的嫌疑程度值对应的所述下游网络节点和所述第一网络节点的依赖链;
    基于确定的所述依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
  6. 如权利要求4或5所述的定位故障的方法,其特征在于,计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
    确定与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值;以及确定所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值;
    对于所述上游网络节点中工作状态处于非正常状态的网络节点,根据各个所述处于非正常状态的网络节点发出故障告警的程度值,分别统计得到属于同一种故障告警程度级别的网络节点的第二个数值;
    根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值。
  7. 如权利要求6所述的定位故障的方法,其特征在于,根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,包括:
    通过以下方式计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
    Figure PCTCN2016080096-appb-100001
    其中,S1i为计算得到与所述第一网络节点具备直接依赖传递关系的该第1i下游网络节点导致发出所述故障告警的网络节点发生故障的故障根源的嫌疑程度值,i的取值范围为1至N,N为自然数,m1i为与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值,n1i为所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值,w1i为属于同一种故障告警程度级别的网络节点的第二个数值。
  8. 一种定位故障的设备,其特征在于,包括:
    接收单元,用于接收至少一个故障告警,其中,每一个所述故障告警中包含发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,所述告警类型包含应用类型、链路类型、设备类型中的至少一种;
    查找单元,用于根据所述故障告警中包含的发出故障告警的第一网络节点的标识和所述第一网络节点发出故障告警的告警类型,查找满足所述第一网络节点发出故障告警的告警类型对应的依赖规则且包含所述第一网络节点的依赖传递关系;
    确定单元,用于根据所述依赖传递关系,确定包含所述第一网络节点和与所述第一网络节点具备依赖传递关系的其他网络节点的依赖链,其中,所述依赖链用于表征从所述第一网络节点到各个所述其他网络节点之间的依赖传递关 系,所述依赖传递关系包括连接关系、包含关系中的至少一种;
    定位单元,用于根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
  9. 如权利要求8所述的定位故障的设备,其特征在于,所述依赖规则包含应用类型依赖于链路类型,链路类型依赖于设备类型中的至少一种;
    所述查找单元,具体用于根据所述故障告警中包含的所述第一网络节点发出故障告警的告警类型,确定满足所述告警类型的依赖规则;
    根据所述依赖规则和所述故障告警中包含的发出故障告警的第一网络节点的标识,查找包含所述第一网络节点的依赖传递关系。
  10. 如权利要求8或9所述的定位故障的设备,其特征在于,
    所述定位单元,具体用于从所述依赖链中处于最上游的网络节点开始,依次执行以下操作,直至所述依赖链中包含的各个所述其他网络节点执行完毕结束:
    确定执行本轮操作的第二网络节点;
    根据所述第二网络节点的工作状态、其具备直接依赖传递关系的下游网络节点的工作状态以及其具备直接依赖传递关系的上游网络节点的工作状态,判断所述第二网络节点是否为导致所述第一网络节点发生故障的网络节点;
    若判断结果为是时,则将所述第二网络节点写入故障根源列表中,继续选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
    若判断结果为否时,则选择与所述第二网络节点具备直接依赖传递关系的下游网络节点为执行下一轮操作的网络节点;
    在所述依赖链中包含的各个所述其他网络节点执行完毕时,将所述故障根源列表中包含的网络节点定位为导致所述第一网络节点发生故障的网络节点;
    其中,所述依赖链中处于最上游的网络节点是指根据依赖传递关系,在所 述依赖链中,该网络节点依赖于所述依赖链中除了该网络节点之外的其他网络节点。
  11. 如权利要求10所述的定位故障的设备,其特征在于,
    所述定位单元,具体用于在所述第二网络节点的工作状态为非正常状态时,进一步确定其不具备直接依赖传递关系的下游网络节点,或者其具备直接依赖传递关系的下游网络节点的导致所述第一网络节点发生故障的故障根源的嫌疑程度值大于设定第一阈值时,确定所述第二网络节点为导致所述第一网络节点发生故障的网络节点。
  12. 如权利要求8至11任一项所述的定位故障的设备,其特征在于,
    所述定位单元,具体用于若确定的所述依赖链的个数为至少两个时,分别计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值;
    从计算得到的多个故障根源的嫌疑程度值中选择数值大于设定第二阈值的故障根源的嫌疑程度值;
    根据选择的所述故障根源的嫌疑程度值,确定包含所述故障根源的嫌疑程度值对应的所述下游网络节点和所述第一网络节点的依赖链;
    基于确定的所述依赖链,根据所述依赖链中包含的各个所述其他网络节点的工作状态,从所述依赖链中包含的各个所述其他网络节点中,定位导致所述第一网络节点发生故障的网络节点。
  13. 如权利要求11或12所述的定位故障的设备,其特征在于,所述定位单元计算与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,具体用于:
    确定与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值;以及确定所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值;
    对于所述上游网络节点中工作状态处于非正常状态的网络节点,根据各个 所述处于非正常状态的网络节点发出故障告警的程度值,分别统计得到属于同一种故障告警程度级别的网络节点的第二个数值;
    根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值。
  14. 如权利要求13所述的定位故障的设备,其特征在于,所述定位单元根据确定的所述总个数值、所述第一个数值和所述第二个数值,计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值,具体用于:
    通过以下方式计算得到与所述第一网络节点具备直接依赖传递关系的下游网络节点导致所述第一网络节点发生故障的故障根源的嫌疑程度值:
    Figure PCTCN2016080096-appb-100002
    其中,S1i为计算得到与所述第一网络节点具备直接依赖传递关系的该第1i下游网络节点导致发出所述故障告警的网络节点发生故障的故障根源的嫌疑程度值,i的取值范围为1至N,N为自然数,m1i为与所述下游网络节点具备依赖传递关系的上游网络节点的总个数值,n1i为所述上游网络节点中工作状态处于非正常状态的网络节点的第一个数值,w1i为属于同一种故障告警程度级别的网络节点的第二个数值。
PCT/CN2016/080096 2015-04-30 2016-04-23 一种定位故障的方法和设备 WO2016173473A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510217812.6A CN106209400B (zh) 2015-04-30 2015-04-30 一种定位故障的方法和设备
CN201510217812.6 2015-04-30

Publications (1)

Publication Number Publication Date
WO2016173473A1 true WO2016173473A1 (zh) 2016-11-03

Family

ID=57199009

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/080096 WO2016173473A1 (zh) 2015-04-30 2016-04-23 一种定位故障的方法和设备

Country Status (2)

Country Link
CN (1) CN106209400B (zh)
WO (1) WO2016173473A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109905270A (zh) * 2018-03-29 2019-06-18 华为技术有限公司 定位根因告警的方法、装置和计算机可读存储介质
CN114285725A (zh) * 2021-12-24 2022-04-05 中国电信股份有限公司 网络故障确定方法及装置、存储介质及电子设备

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109257195B (zh) * 2017-07-12 2021-01-15 华为技术有限公司 集群中节点的故障处理方法及设备
CN108322351B (zh) * 2018-03-05 2021-09-10 北京奇艺世纪科技有限公司 生成拓扑图的方法和装置、故障确定方法和装置
CN110380878A (zh) * 2018-04-12 2019-10-25 阿里巴巴集团控股有限公司 链路巡检方法、装置及电子设备
CN109272651A (zh) * 2018-08-29 2019-01-25 北京华沁智联科技有限公司 自动售卖仓库的坐标检测方法、装置及系统
CN109446291B (zh) * 2018-10-23 2022-05-13 山东中创软件商用中间件股份有限公司 一种路网状态统计方法、装置和计算机可读存储介质
CN109828788A (zh) * 2018-12-21 2019-05-31 天翼电子商务有限公司 基于线程级推测执行的规则引擎加速方法及系统
CN111786806B (zh) * 2019-04-04 2022-03-01 大唐移动通信设备有限公司 一种网元异常处理方法及网管系统
CN110071828A (zh) * 2019-04-11 2019-07-30 中国移动通信集团内蒙古有限公司 告警方法、装置、设备及存储介质
CN110661660B (zh) * 2019-09-25 2021-09-10 北京宝兰德软件股份有限公司 告警信息根源分析方法及装置
CN112087320B (zh) * 2020-08-16 2023-06-16 中信百信银行股份有限公司 一种异常定位方法、装置、电子设备和可读存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104009854A (zh) * 2013-02-21 2014-08-27 中兴通讯股份有限公司 一种告警处理方法及装置、告警关联信息设置方法
US20140267788A1 (en) * 2013-03-15 2014-09-18 General Instrument Corporation Method for identifying and prioritizing fault location in a cable plant
CN104219087A (zh) * 2014-08-08 2014-12-17 蓝盾信息安全技术有限公司 一种故障定位的方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100589418C (zh) * 2007-12-10 2010-02-10 中兴通讯股份有限公司 告警相关性规则的生成方法及生成系统
CN103023028B (zh) * 2012-12-17 2015-09-02 江苏省电力公司 一种基于实体间依赖关系图的电网故障快速定位方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104009854A (zh) * 2013-02-21 2014-08-27 中兴通讯股份有限公司 一种告警处理方法及装置、告警关联信息设置方法
US20140267788A1 (en) * 2013-03-15 2014-09-18 General Instrument Corporation Method for identifying and prioritizing fault location in a cable plant
CN104219087A (zh) * 2014-08-08 2014-12-17 蓝盾信息安全技术有限公司 一种故障定位的方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109905270A (zh) * 2018-03-29 2019-06-18 华为技术有限公司 定位根因告警的方法、装置和计算机可读存储介质
CN114285725A (zh) * 2021-12-24 2022-04-05 中国电信股份有限公司 网络故障确定方法及装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN106209400B (zh) 2018-12-07
CN106209400A (zh) 2016-12-07

Similar Documents

Publication Publication Date Title
WO2016173473A1 (zh) 一种定位故障的方法和设备
US10075327B2 (en) Automated datacenter network failure mitigation
US10616268B2 (en) Anomaly detection method for the virtual machines in a cloud system
US7631222B2 (en) Method and apparatus for correlating events in a network
US7287193B2 (en) Methods, systems, and media to correlate errors associated with a cluster
JP5723990B2 (ja) ファブリックに対する情報を集めるためにエージェントの等価サブセットを定める方法、およびそのシステム。
CN109257195A (zh) 集群中节点的故障处理方法及设备
CN113259168B (zh) 一种故障根因分析方法及装置
JP2013542476A5 (zh)
WO2016095710A1 (zh) 一种调整srlg的方法和装置
CN112887108A (zh) 故障定位方法、装置、设备及存储介质
US20210014127A1 (en) Self-healing fabrics
CN107104820B (zh) 基于f5服务器节点的动态扩容日常运维方法
CN110737924B (zh) 一种数据保护的方法和设备
CN102281103B (zh) 基于模糊集合解算的光网络多故障恢复方法
CN108293003A (zh) 分布式图处理网络的容错
TWI774060B (zh) 用於階層式系統之故障排除之裝置、方法及電腦程式產品
JP6323243B2 (ja) システム及び異常検知方法
US9443196B1 (en) Method and apparatus for problem analysis using a causal map
CN106528324A (zh) 故障恢复的方法和装置
Liu et al. Node Importance Evaluation of Cyber-Physical System under Cyber-Attacks Spreading
CN110719191A (zh) 一种面向次生失效的网络可靠性评估方法
CN110399261B (zh) 一种基于共现图的系统告警聚类分析方法
US10599509B2 (en) Management system and management method for computer system
CN105450518B (zh) 一种mpls-tp环网故障排除方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16785904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16785904

Country of ref document: EP

Kind code of ref document: A1