WO2024169467A1 - 分布式网络的故障定位方法、网络设备和存储介质 - Google Patents

分布式网络的故障定位方法、网络设备和存储介质 Download PDF

Info

Publication number
WO2024169467A1
WO2024169467A1 PCT/CN2024/071154 CN2024071154W WO2024169467A1 WO 2024169467 A1 WO2024169467 A1 WO 2024169467A1 CN 2024071154 W CN2024071154 W CN 2024071154W WO 2024169467 A1 WO2024169467 A1 WO 2024169467A1
Authority
WO
WIPO (PCT)
Prior art keywords
indicator
node
fault
relationship chain
target
Prior art date
Application number
PCT/CN2024/071154
Other languages
English (en)
French (fr)
Inventor
王太宝
屠要峰
徐进
王德政
洪科
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2024169467A1 publication Critical patent/WO2024169467A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Definitions

  • the present application belongs to the field of communications, and specifically relates to a distributed network fault location method, network equipment and storage medium.
  • the embodiments of the present application provide a distributed network fault location method, network device and storage medium, which can solve the problem of low efficiency of the related art relying on manual fault location.
  • a method for locating a fault in a distributed network comprises: determining an indicator node relationship chain based on a preset node information file; wherein the indicator node relationship chain comprises a plurality of indicator nodes, and the indicator node relationship chain is used to reflect the association relationship between the indicator nodes; collecting indicator data of a target object; the target object comprises the plurality of indicator nodes; and determining an abnormal indicator node from the plurality of indicator nodes based on the indicator data; Based on the abnormal indicator node, a fault indicator node relationship chain is determined from the indicator node relationship chain; and a fault root cause node is determined from the fault indicator node relationship chain.
  • a network device comprising: a processor and a memory, wherein the memory stores a program, and when the program is executed, the method described in the first aspect is implemented.
  • an embodiment of the present application provides a readable storage medium, on which a program or instruction is stored, and when the program or instruction is executed, the steps of the method described in the first aspect are implemented.
  • FIG1 is a flow chart of a distributed network fault location method provided in an embodiment of the present application.
  • FIG2 is a flow chart of another distributed network fault location method provided in an embodiment of the present application.
  • FIG3-1 is a schematic diagram of an exemplary indicator node relationship chain provided in an embodiment of the present application.
  • FIG3-2 is a schematic diagram of an exemplary fault indicator node relationship chain provided in an embodiment of the present application.
  • FIG4-1 is a schematic diagram of another example indicator node relationship chain provided in an embodiment of the present application.
  • FIG4-2 is a schematic diagram of another example of a fault indicator node relationship chain provided in an embodiment of the present application.
  • FIG5 is a structural block diagram of a fault location device for a distributed network provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a network device provided in an embodiment of the present application.
  • first, second, etc. in the specification and claims of the present application are used to distinguish similar objects, rather than to describe a specific order or sequence. It should be understood that the terms used in this way can be interchanged where appropriate, so that the embodiments of the present application can be used in other ways than those shown or described herein.
  • the order of implementation is not limited to those described above, and the objects distinguished by “first”, “second”, etc. are usually of the same type, and the number of objects is not limited.
  • the first object can be one or more.
  • “and/or” in the specification and claims means at least one of the connected objects, and the character “/" generally means that the related objects are in an "or” relationship.
  • the distributed network fault location method provided in the embodiment of the present application is aimed at the distributed service fault handling method of the related technology, and transforms the original manual step-by-step troubleshooting method into automatic and accurate location, which alleviates the complexity and time-consuming cost of fault location.
  • the entire method is executed by network equipment rather than manually, and the execution efficiency is greatly improved, thereby solving the problem of low efficiency in the related technology that relies on manual fault location.
  • self-healing processing can also be performed after fault location to improve the efficiency of fault resolution.
  • the embodiments of the present application may adopt a self-implemented association processing algorithm and a dynamic threshold algorithm.
  • the association processing algorithm may, for example, perform association location based on the assigned indicator ID, and may organize the originally independent services and indicators into an indicator node relationship chain.
  • the indicator abnormality threshold may be dynamically changed according to the dynamic threshold algorithm.
  • the dynamic threshold algorithm may be continuously trained and learned according to the configured default threshold to obtain a relatively reasonable indicator abnormality threshold. Then, based on the indicator node relationship chain obtained after the association processing, a fault indicator node relationship chain may be further obtained to screen out indicator nodes that may have faults.
  • a self-implemented self-healing processing algorithm may be adopted to promptly handle different types of abnormal faults, thereby effectively improving the fault recovery efficiency and user experience.
  • the distributed network fault location method provided in the embodiment of the present application is applied in a distributed network environment, wherein the distributed network includes multiple computers that can use the http/https protocol.
  • the http/https protocol is used for network data transmission and request.
  • FIG1 is a flow chart of a distributed network fault location method provided in an embodiment of the present application.
  • the fault location method can be applied to network devices.
  • the distributed network fault location method provided in an embodiment of the present application may include the following steps.
  • Step 110 based on the preset node information file, determine the indicator node relationship chain;
  • the indicator node relationship chain includes multiple indicator nodes, and the indicator node relationship chain is used to reflect the association relationship between the various indicator nodes.
  • the preset node information file may be any file that can be used to determine an indicator node relationship chain.
  • the node information file may include indicator identifiers corresponding to indicator nodes, and associations between indicator identifiers.
  • the indicator identifier may include an indicator name or an indicator ID, etc.
  • determining the indicator node relationship chain may include: determining the indicator node relationship chain based on the indicator identifiers in the node information file, and the associations between the indicator identifiers; wherein, an indicator identifier corresponds to an indicator node in the indicator node relationship chain.
  • the association between the indicator identifiers in the preset node information file can be directly used to quickly construct an indicator node relationship chain.
  • the preset node information file may also directly include an indicator node relationship chain, so that the indicator node relationship chain in this node information file can be directly obtained.
  • the preset node information file may be one file or may include multiple files.
  • the preset node information file may be, for example, a node configuration file for constructing an indicator node relationship chain.
  • the node configuration file may include indicator identifiers corresponding to indicator nodes and the association relationship between indicator identifiers.
  • the preset node information file may include, for example, a node configuration file and a service indicator configuration file.
  • the service indicator configuration file may include a service name and an upstream and downstream association relationship between each service name, and each service name corresponds to at least one indicator identifier.
  • the node configuration file may include a connection relationship between indicator identifiers corresponding to two adjacent service names having the upstream and downstream association relationship. Accordingly, in the process of determining the indicator node relationship chain based on the indicator identifiers in the node information file and the association relationship between the indicator identifiers, the indicator node relationship chain may be formed according to the connection relationship in the node configuration file. In this way, the node configuration file and the service indicator configuration file may be associated with each other through the upstream and downstream association relationship between the service names as a link, and then the indicator node relationship chain is formed based on the connection relationship in the node configuration file.
  • This method of determining the indicator node relationship chain through multiple files The method of defining the indicator node relationship chain allows each file to focus on its own function, and can facilitate searching and obtaining the data in each file according to the function while forming the indicator node relationship chain.
  • the upstream and downstream association relationship in the embodiment of the present application can represent a specific association between two adjacent objects. Taking the adjacent first object and second object as an example, if the upstream object of the first object is the second object, it can be said that the second object causes the abnormality of the first object. In this case, the first object is the downstream object of the second object.
  • the node information file may include, for example, a node configuration file and a service indicator configuration file.
  • the node configuration file includes a first service identifier and an indicator identifier that is mapped to the first service identifier, and at least one target indicator identifier exists in the indicator identifier.
  • the service indicator configuration file includes a second service identifier that is associated with the target indicator identifier. Wherein, both the first service identifier and the second service identifier may include a service name.
  • the determining of the indicator node relationship chain based on the indicator identifier in the node information file and the associated relationship between indicator identifiers may include: obtaining the first service identifier in the node configuration file and the indicator identifier that is mapped to the first service identifier; determining the second service identifier corresponding to the target indicator identifier in the indicator identifier from the service indicator configuration file; determining the indicator node relationship chain based on the first service identifier, the indicator identifier that is mapped to the first service identifier, and the second service identifier corresponding to the target indicator identifier; wherein, a first service identifier and a second service identifier each correspond to a node in the indicator node relationship chain.
  • the node configuration file and the service indicator configuration file may be associated with each other through the target indicator identifier as a link, thereby forming the indicator node relationship chain.
  • This method of determining the indicator node relationship chain through multiple files allows each file to focus on its own function, and can facilitate searching and obtaining data in each file according to function while forming the indicator node relationship chain.
  • Step 120 collecting indicator data of a target object; the target object includes the multiple indicator nodes.
  • the target object may be the multiple indicator nodes.
  • the target object may also include other indicator nodes in addition to the multiple indicator nodes.
  • the index data may include the index value of the index node.
  • the index data in the embodiment of the present application may also include at least one of the following: an index ID and an index name.
  • the index ID may be in the form of a number, such as 1001.
  • the indicator data of the target object can be collected based on a preset node information file.
  • the preset node information file may contain indicator collection information, and the indicator collection information may include a description of the indicator, a service name where the indicator is located, an indicator ID, a request method for indicator collection, and request parameters. In this way, specific indicator data under a specific service instance can be accurately obtained.
  • Step 130 determining an abnormal indicator node from the plurality of indicator nodes based on the indicator data.
  • the indicator data may include the indicator values of the multiple indicator nodes
  • the node information file may also include an indicator abnormality threshold.
  • the step 130 of determining abnormal indicator nodes from the multiple indicator nodes may include: determining an indicator node whose indicator value is greater than the indicator abnormality threshold from the multiple indicator nodes; determining an indicator node whose indicator value is greater than the indicator abnormality threshold as an abnormal indicator node. In this way, by comparing the indicator value of the indicator node in the collected indicator data with the indicator abnormality threshold contained in the node information file, the abnormal indicator node can be quickly determined.
  • the indicator abnormality threshold may be one or more.
  • the indicator value of each indicator node may be compared with the indicator abnormality threshold.
  • the number of indicator abnormality thresholds may be the same as the number of the multiple indicator nodes. That is, the multiple indicator nodes include N indicator nodes, N is a positive integer greater than 1, the indicator abnormality threshold includes N thresholds, and the N indicator nodes correspond to the N thresholds one by one.
  • an indicator node determines whether the indicator value of the indicator node is greater than the indicator abnormality threshold corresponding to the indicator node. In this way, since an indicator abnormality threshold corresponds to an indicator node, the indicator abnormality thresholds of each indicator node can be different, which can ensure that the accuracy of the abnormal indicator node obtained is higher. It should be understood that the indicator abnormality threshold in the embodiment of the present application can be updated periodically. In other words, all or part of the N thresholds can be updated periodically.
  • the indicator abnormality threshold may be present in the service indicator configuration file.
  • the indicator repair information mentioned later may also be present in the service indicator configuration file.
  • Step 140 Based on the abnormal indicator node, determine the fault indicator node relationship chain from the indicator node relationship chain.
  • each abnormal indicator node in the indicator node relationship chain may be connected to each other to form a fault indicator node relationship chain.
  • Step 150 determine the fault root cause node from the fault indicator node relationship chain.
  • determining the fault root cause node from the fault indicator node relationship chain in step 150 may include: taking the target indicator node in the fault indicator node relationship chain that does not have an upstream indicator node as the fault root cause node; wherein the upstream indicator node includes the node in the fault indicator node relationship chain that causes the target indicator node to be abnormal. In this way, by determining whether there is an upstream indicator node, the fault root cause node can be quickly determined from the fault indicator node relationship chain.
  • the embodiments of the present application do not limit the process of determining the target indicator node.
  • a first designated node in the fault indicator node relationship chain can be obtained, and the first designated node is located at the end of the fault indicator node relationship chain; the target indicator node is determined from the first designated node, wherein there is no upstream node in the fault indicator node relationship chain that has a mapping relationship with the target indicator node; the upstream node includes the node that causes the target indicator node.
  • the target indicator node is determined as the fault root cause node.
  • an indicator node relationship chain is introduced, and a fault indicator node relationship chain is determined based on the indicator node relationship chain, thereby determining the fault root cause node; on the one hand, since the indicator node relationship chain is used to reflect the association between each node, the fault root cause node can be quickly determined from the fault indicator node relationship chain obtained based on the indicator node relationship chain; on the other hand, the entire method is executed by network equipment rather than manually, and the execution efficiency is greatly improved, thereby solving the problem of low efficiency in the related technology that relies on manual fault location.
  • the indicator node relationship chain and the fault indicator node relationship chain are further discussed below in conjunction with specific schematic diagrams. It should be understood that the indicator node relationship chain in the embodiment of the present application may include nodes corresponding to services, or may not include nodes corresponding to services, that is, only include nodes corresponding to indicator identifiers.
  • indicator node relationship chain in the embodiment of the present application only includes nodes corresponding to indicator identifiers
  • an example of the indicator node relationship chain can be referred to Figure 3-1.
  • indicators a-d, indicators f-g, indicators j-k, indicators m and indicator x can all correspond to indicator identifiers.
  • the upstream indicator nodes of indicator x are indicator a, indicator b and indicator c.
  • the upstream indicator nodes of indicator a are indicator d, indicator f and indicator g.
  • the upstream indicator nodes of indicator c are indicator j, indicator k and indicator m.
  • a fault indicator node relationship chain as exemplified in the bold part of Figure 3-2 can be obtained, that is, indicator x, indicator a, and indicator g form a fault indicator node relationship chain.
  • an example of the indicator node relationship chain can be referred to Figure 4-1.
  • indicator ad, indicator fg, indicator jk and indicator m can all correspond to indicator identifiers.
  • the indicators collected under service A are indicator a, indicator b and indicator c.
  • the indicators collected under service B are indicator d, indicator f and indicator g.
  • the indicators collected under service C are indicator j, indicator k and indicator m. m.
  • the upstream node of indicator a is service B, and the upstream node of indicator c is service C.
  • a fault indicator node relationship chain can be obtained as shown in the bold part of Figure 4-2, that is, service A, indicator a, service B and indicator g form a fault indicator node relationship chain.
  • the fault location method provided by the embodiment of the present application can not only determine the root cause node of the fault, but also perform repair processing on the fault.
  • the node information file may include: indicator repair information of the root cause node of the fault.
  • the fault location method provided by the embodiment of the present application may also include: performing fault repair on the root cause node of the fault using the indicator repair information of the root cause node of the fault.
  • the fault location method may also include: when any target condition is not met, re-determine the fault root cause node, and repair the re-determined fault root cause node until at least one of the target conditions is met; wherein the target conditions include: the indicator repair information of the fault root cause node is empty, the repair of the fault root cause node fails, or, after the repair of the fault root cause node, all nodes in the fault indicator node relationship chain return to normal. wherein, the process of re-determining the fault root cause node can refer to the previous description.
  • an alarm prompt information can be issued, and the alarm prompt information is used to instruct manual repair of the node that failed to be repaired.
  • the alarm prompt information can be in the form of text, image, or sound.
  • the fault root cause node is repaired using the indicator repair information of the fault root cause node, in some cases, there may be no fault in the distributed network, and there is no need to re-determine the fault root cause node.
  • the embodiment of the present application can repair only the root cause node of the fault, or can repair the fault indicator nodes other than the root cause node in the fault indicator node relationship chain.
  • the following discusses the method of repairing one or more fault indicator nodes in the fault indicator node relationship chain.
  • the node information file may include: indicator repair information of each indicator node.
  • the indicator repair information of each node is obtained in sequence; the first target node is the fault root cause node in the fault indicator node relationship chain, and the first target node and the second target node are respectively located at the two ends of the fault indicator node relationship chain; using the indicator repair information of each node, each node is repaired in sequence starting from the first target node.
  • the process of repairing each node in sequence starting from the first target node using the indicator repair information of each node may include: repairing the third target node using the indicator repair information of the third target node; if the third target node is successfully repaired, determining a fourth target node associated with the third target node; repairing the fourth target node using the indicator repair information of the fourth target node; wherein the third target node and the fourth target node are two adjacent nodes in the fault indicator node relationship chain.
  • the fault repair can be performed by first repairing the upstream node, which can further improve the success rate of the repair.
  • the third target node can be a fault root cause node in the fault indicator node relationship chain, or it can be a node in the fault indicator node relationship chain other than the fault root cause node.
  • the abnormality of the fourth target node may include an abnormality caused by the third target node.
  • the third target node is successfully repaired, and the abnormality of the fourth target node may disappear or may not disappear.
  • the abnormality of the fourth target node may include abnormalities caused by other factors in addition to the abnormality caused by the third target node, such as abnormalities caused by external environmental factors. In other words, the successful repair of the third target node does not mean that the abnormality of the fourth target node will definitely disappear.
  • the indicator repair information of the fourth target node can be used for repair.
  • the indicator repair information of the fourth target node can still be used for repair.
  • the indicator repair information of the fourth target node can be directly used for repair. In this way, it can be ensured that the repair is performed in a unified repair method, thereby improving the repair efficiency.
  • FIG2 is a flow chart of a distributed network fault location method provided in an embodiment of the present application.
  • the fault location method can be applied to network devices.
  • the distributed network fault location method provided in an embodiment of the present application may include the following steps.
  • Step 210 based on a preset node information file, determine an indicator node relationship chain; wherein the indicator node relationship chain includes a plurality of indicator nodes, and the indicator node relationship chain is used to reflect the association relationship between the various indicator nodes.
  • Step 220 collecting indicator data of a target object; the target object includes the multiple indicator nodes.
  • Step 230 Based on the indicator data, determine an abnormal indicator node from the multiple indicator nodes.
  • Step 240 Based on the abnormal indicator node, determine the fault indicator node relationship chain from the indicator node relationship chain.
  • Step 250 determine the fault root cause node from the fault indicator node relationship chain.
  • Step 260 Repair the fault root cause node by using the indicator repair information of the fault root cause node.
  • Step 270 when any of the target conditions are not met, redetermine the fault root cause node, and repair the redetermined fault root cause node until at least any of the target conditions is met.
  • the target conditions include: the indicator repair information of the fault root cause node is empty, the repair of the fault root cause node fails, or, after the fault root cause node is repaired, all nodes in the fault indicator node relationship chain return to normal.
  • step 210 to step 250 may refer to the above description.
  • step 250 after step 250, step 260 and step 270 may not be performed, but the following process may be performed: starting from the first target node in the fault indicator node relationship chain toward the second target node, the indicator repair information of each node is obtained in sequence; the first target node is the fault root cause node in the fault indicator node relationship chain, and the first target node and the second target node are respectively located at the two ends of the fault indicator node relationship chain; using the indicator repair information of each node to repair each node in sequence starting from the first target node.
  • using the indicator repair information of each node to repair each node in sequence starting from the first target node includes: using the indicator repair information of the third target node to repair the third target node; if the third target node is successfully repaired, determining the fourth target node that has an associated relationship with the third target node; using the indicator repair information of the fourth target node to repair the fourth target node; wherein the third target node and the fourth target node are two adjacent nodes in the fault indicator node relationship chain.
  • only the determined root cause node of the fault can be repaired each time, or multiple nodes including the root cause node of the fault can be repaired each time. node.
  • the fault root cause node is determined; on the one hand, since the indicator node relationship chain is used to reflect the association relationship between each node, the fault root cause node can be quickly determined from the fault indicator node relationship chain obtained based on the indicator node relationship chain; on the other hand, the entire method is executed by the network device rather than manually, and the execution efficiency is greatly improved, thereby solving the problem of low efficiency in the related technology that relies on manual fault location.
  • the fault can be quickly located, and the self-healing processing of the fault can be realized, which greatly reduces or even eliminates the fault repair operation performed by human intervention, and improves work efficiency.
  • the following uses the preset node information file including the node configuration file and the service indicator configuration file as an example to further discuss the fault location method of the distributed network provided by the embodiment of the present application. It should be understood that the content described below is only an example and not a limitation.
  • the process of the distributed network fault location method provided in the embodiment of the present application can be as follows.
  • the service indicator configuration file may contain the types of abnormalities existing in the current service, each type may contain multiple indicator information, and each indicator information may include an indicator ID (unique and non-repetitive), indicator abnormality threshold, upstream service name causing the indicator abnormality, and indicator repair information.
  • indicator ID unique and non-repetitive
  • indicator abnormality threshold indicator abnormality threshold
  • upstream service name causing the indicator abnormality
  • indicator repair information An example of specific related content can be shown below.
  • the current indicator 1000 fails, check the related indicators of its dependent service hbase and repair it. If the hbase fault indicator has been repaired, then perform the current repair steps of 1000. If the repair suggestion for indicator 1002 is null (empty information, indicating that there are currently no recommended steps for automatic repair), an alarm will be sent to the platform for further confirmation and analysis by the operation and maintenance personnel; if the associated service information is empty, it means that the indicator is already the last root cause indicator, and then check the current repair steps. If the repair steps are null, an alarm will be sent to the platform for further analysis.
  • the node configuration file of the distributed platform may include the access Access information, indicator list (all indicators can be summarized in the service indicator configuration file).
  • Each element in the list represents a specific indicator collection information, including the description of the indicator, the service name where the indicator is located, the indicator ID (this indicator ID can be the same as the indicator ID contained in the service indicator configuration file, so as to be used for associated processing logic), the request method and request parameters for indicator collection (the request parameters include the cluster name, host name, and service instance name to be used to accurately identify the specific indicator under the specific service instance).
  • the request parameters include the cluster name, host name, and service instance name to be used to accurately identify the specific indicator under the specific service instance.
  • the index in this example is 1001, which comes from service-A, type,
  • the url and requestBody content represent the collection information required to access the indicator. Based on this information, specific indicator values can be collected to compare with the indicator anomaly threshold in the service indicator configuration file to determine whether the indicator is abnormal.
  • the configuration file After the configuration file is configured, it can be deployed on network devices such as platform control nodes.
  • the configuration file is loaded, and after loading the configuration file, the indicator information of the configuration file can be collected (the collection method can be timed, for example, the default is 1 hour), and the format of the collected indicator data can be as follows.
  • the indicators collected under the Server-A service include three index-a, index-b, and index-c, as well as the corresponding indicator IDs and collected indicator values.
  • the configuration information of the upstream service name that caused the indicator abnormality in the configuration file is used to obtain the previous node in the relationship chain.
  • the upstream service of the index-a indicator of the Server-A service is Server-B
  • the upstream service of the index-c indicator is Server-C
  • an indicator node relationship chain as shown in Figure 4-1 can be obtained.
  • the service nodes shown in Figure 4-1 namely the service A node, the service B node, and the service C node, are only an example. In an embodiment of the present application, the service nodes may not exist, but the indicator node relationship chain is presented in the form of indicator nodes.
  • the indicator abnormal threshold can be initialized and configured first, and then the threshold can be trained and learned with the default value when the scheduled task is executed next time, and a new abnormal threshold can be updated. In other words, the indicator abnormal threshold can be a dynamically adjusted threshold.
  • the self-healing processing algorithm can be used to find the indicator repair information of the root cause node (for example, indicator g) of the fault from the service indicator configuration file. If the indicator repair information corresponding to the root cause node of the fault is empty, it means that the indicator cannot be automatically repaired, and an alarm prompt message will be generated and sent to the platform, which will be handed over to the operation and maintenance personnel for further confirmation and analysis. If the indicator repair information is not empty, the corresponding repair operation is matched in the self-healing processing logic.
  • the fault indicator node relationship chain can be checked regularly. If all nodes in the fault indicator node relationship chain have been restored, no processing is required. If one of the nodes is still faulty, the fault root cause node can be re-determined when any of the target conditions are not met, and the re-determined fault root cause node can be repaired until at least one of the target conditions is met; wherein the target conditions include: the indicator repair of the fault root cause node If the information is empty, the root cause node fails to be repaired, or, after the root cause node is repaired, all nodes in the fault indicator node relationship chain return to normal.
  • an alarm prompt message may be generated. After the operation and maintenance personnel handle the alarm prompt message, the entire fault location and self-healing process may be terminated.
  • the distributed network fault location method provided in the embodiment of the present application has the following beneficial effects: the fault information in the distributed network is associated, so that the root cause of the fault can be automatically identified through the fault association relationship; after the fault location is confirmed, self-healing repair operations are performed according to the loaded configuration file to minimize the steps of human intervention; horizontal expansion is supported, that is, after adding a new service, you only need to configure the corresponding configuration file, and reload it after confirmation by the operation and maintenance personnel to take effect.
  • Fig. 5 is a structural block diagram of a distributed network fault location device provided in an embodiment of the present application.
  • the distributed network fault location device provided in an embodiment of the present application may include: a determination module 510 and a collection module 520 .
  • the determination module 510 is used to determine an indicator node relationship chain based on a preset node information file; wherein the indicator node relationship chain includes a plurality of indicator nodes, and the indicator node relationship chain is used to reflect the association relationship between the various indicator nodes.
  • the collection module 520 is used to collect the indicator data of the target object; the target object includes the multiple indicator nodes.
  • the determination module 510 is further used to determine an abnormal indicator node from the multiple indicator nodes based on the indicator data; determine a fault indicator node relationship chain from the indicator node relationship chain based on the abnormal indicator node; and determine a fault root cause node from the fault indicator node relationship chain.
  • the fault root cause node can be quickly determined from the fault indicator node relationship chain obtained based on the indicator node relationship chain; on the other hand, the whole process is executed by the fault locating device instead of manual work, and the execution efficiency is greatly improved, thereby solving the problem of low efficiency of the related technology relying on manual fault locating.
  • the node information file includes indicator identifiers corresponding to the indicator nodes and the association relationship between the indicator identifiers.
  • the determination module 510 is specifically used to: determine the indicator node relationship chain based on the indicator identifiers in the node information file and the association relationship between the indicator identifiers; wherein one indicator identifier corresponds to one indicator node in the indicator node relationship chain.
  • the node information file includes a node configuration file and a service indicator configuration file;
  • the service indicator configuration file includes a service name, and an upstream and downstream association relationship between each service name, and each service name corresponds to at least one indicator identifier;
  • the node configuration file includes a connection relationship between indicator identifiers corresponding to two adjacent service names having the upstream and downstream association relationship.
  • the determination module 510 is specifically used to: form the indicator node relationship chain according to the connection relationship in the node configuration file.
  • the indicator data includes the indicator values of the multiple indicator nodes
  • the node information file also includes: an indicator abnormality threshold corresponding to each indicator node; wherein the multiple indicator nodes include N indicator nodes, N is a positive integer greater than 1, and the indicator abnormality threshold includes N thresholds, and the N indicator nodes correspond to the N thresholds one by one.
  • the determination module 510 is specifically used to: for each indicator node in the multiple indicator nodes, determine whether the indicator value of the indicator node is greater than the indicator abnormality threshold corresponding to the indicator node; determine the indicator node whose indicator value is greater than the indicator abnormality threshold corresponding to the indicator node as an abnormal indicator node.
  • the determination module 510 in the process of determining the fault root cause node from the fault indicator node relationship chain, is specifically used to: use the target indicator node in the fault indicator node relationship chain that does not have an upstream indicator node as the fault root cause node.
  • the upstream indicator node includes the node in the fault indicator node relationship chain that causes the target indicator node to be abnormal.
  • the node information file also includes: indicator repair information of the fault root cause node.
  • the distributed network fault location device provided by the embodiment of the present application also includes: a repair module, which is used to repair the fault root cause node using the indicator repair information of the fault root cause node after the determination module 510 determines the fault root cause node from the fault indicator node relationship chain.
  • the determination module 510 is also used to: when any target condition is not met, redetermine the fault root cause node, and repair the redetermined fault root cause node until at least any one of the target conditions is met; wherein the target conditions include: the indicator repair information of the fault root cause node is empty, the repair of the fault root cause node fails, or, after the fault root cause node is repaired, all nodes in the fault indicator node relationship chain return to normal.
  • the node information file also includes: indicator repair information of each indicator node.
  • the fault location device of the distributed network provided by the embodiment of the present application also includes: an acquisition module, which is used to sequentially acquire the indicator repair information of each node from the first target node in the fault indicator node relationship chain toward the second target node; the first target node is the fault root cause node in the fault indicator node relationship chain, and the first target node and the second target node are respectively located at the two ends of the fault indicator node relationship chain; a repair module, which is used to use the indicator repair information of each node to repair each node sequentially starting from the first target node.
  • the repair module in the step of using the indicator repair information of each node from the When the first target node starts to repair each node in turn, the repair module is specifically used to: repair the third target node using the indicator repair information of the third target node; if the third target node is repaired successfully, determine the fourth target node that has an associated relationship with the third target node; repair the fourth target node using the indicator repair information of the fourth target node; wherein the third target node and the fourth target node are two adjacent nodes in the fault indicator node relationship chain.
  • the distributed network fault location device provided in the embodiment of the present application corresponds to the distributed network fault location method mentioned above.
  • the relevant contents can refer to the description of the method above, which will not be repeated here.
  • the embodiment of the present application further provides a network device 600, which may be various types of computers, etc.
  • the network device 600 includes: a processor 610 and a memory 620, wherein the memory 620 stores programs or instructions, and when the programs or instructions are executed by the processor 610, the steps of any of the methods described above are implemented.
  • An embodiment of the present application further provides a readable storage medium, on which a program or instruction is stored.
  • a program or instruction is stored on which a program or instruction is stored.
  • the hierarchy of the program or instruction may be as follows: the first layer is the application service layer, which mainly provides external use functions; the second layer is the business service layer, which mainly performs internal logic calls. In the embodiment of the present application, it is mainly divided into the following three parts: association processing algorithm logic module, dynamic threshold algorithm logic module, and self-healing processing logic module; the third layer is the platform service layer, which mainly includes some logic calls provided by the platform.
  • the platform provides a query interface to query each service indicator information, and the platform assigns a unique service ID to each service indicator;
  • the fourth layer is the basic service layer, which mainly provides the underlying basic services required by the distributed system involved in the embodiment of the present application as support, such as the distributed framework Spring Boot, Spring Cloud and other distributed frameworks and communication frameworks.
  • the embodiments of the present invention may be provided as methods, systems, or Computer Program Product.
  • the present invention may take the form of an all-hardware embodiment, an all-software embodiment, or an embodiment combining software and hardware aspects.
  • the present invention may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram can be implemented by computer program instructions.
  • These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
  • These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
  • a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
  • processors CPU
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media that can be Any method or technology to achieve information storage.
  • Information can be computer-readable instructions, data structures, modules of programs or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory media such as modulated data signals and carrier waves.
  • the embodiments of the present application may be provided as methods, systems or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Computer And Data Communications (AREA)

Abstract

本申请公开了一种分布式网络的故障定位方法、网络设备和存储介质,属于通信领域。所述方法包括:基于预设的节点信息文件,确定指标节点关系链;其中,所述指标节点关系链包括多个指标节点,所述指标节点关系链用于反映各个指标节点之间的关联关系;采集目标对象的指标数据;所述目标对象包括所述多个指标节点;基于所述指标数据,从所述多个指标节点中确定异常指标节点;基于所述异常指标节点,从所述指标节点关系链中确定故障指标节点关系链;从所述故障指标节点关系链中确定故障根因节点。

Description

分布式网络的故障定位方法、网络设备和存储介质
相关申请的交叉引用
本申请要求在2023年02月14日提交中国专利局、申请号为202310156176.5、发明名称为“分布式网络的故障定位方法、网络设备和存储介质”的中国专利申请的优先权,该中国专利申请的全部内容通过引用包含于此。
技术领域
本申请属于通信领域,具体涉及一种分布式网络的故障定位方法、网络设备和存储介质。
背景技术
随着通信技术的发展,分布式网络得到了广泛应用。在分布式网络中,如何对故障进行定位是一个值得关注的问题。
相关技术在对故障进行定位的过程中,往往是依赖运维人员人工进行故障定位。随着业务的日益扩增,分布式网络也变得日益复杂,这种依赖人工进行故障定位的方式存在效率较低的问题。
发明内容
本申请实施例提供一种分布式网络的故障定位方法、网络设备和存储介质,能够解决相关技术依赖人工进行故障定位的方式存在效率较低的问题。
第一方面,提供一种分布式网络的故障定位方法,应用于网络设备,所述方法包括:基于预设的节点信息文件,确定指标节点关系链;其中,所述指标节点关系链包括多个指标节点,所述指标节点关系链用于反映各个指标节点之间的关联关系;采集目标对象的指标数据;所述目标对象包括所述多个指标节点;基于所述指标数据,从所述多个指标节点中确定异常指标节点; 基于所述异常指标节点,从所述指标节点关系链中确定故障指标节点关系链;从所述故障指标节点关系链中确定故障根因节点。
第二方面,提供了一种网络设备,包括:处理器和存储器,所述存储器存储程序,所述程序被执行时实施如第一方面所述的方法。
第三方面,本申请实施例提供了一种可读存储介质,所述存储介质上存储程序或指令,所述程序或指令被执行时实现如第一方面所述的方法的步骤。
附图说明
图1是本申请实施例提供的一种分布式网络的故障定位方法的流程图;
图2是本申请实施例提供的另一种分布式网络的故障定位方法的流程图;
图3-1是本申请实施例提供的一种示例的指标节点关系链的示意图;
图3-2是本申请实施例提供的一种示例的故障指标节点关系链的示意图;
图4-1是本申请实施例提供的另一种示例的指标节点关系链的示意图;
图4-2是本申请实施例提供的另一种示例的故障指标节点关系链的示意图;
图5是本申请实施例提供的一种分布式网络的故障定位装置的结构框图;
图6是本申请实施例提供的一种网络设备的示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描 述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。
本申请实施例提供的分布式网络的故障定位方法,针对相关技术的分布式服务故障处理方式,将原先的人工逐步排查故障的方式,转变为自动精准定位,缓解了故障定位的复杂性和耗时开销。整个方法是由网络设备执行而非人工完成,执行效率大大提升,进而解决了相关技术依赖人工进行故障定位的方式存在效率较低的问题。同时,在故障定位之后还可以进行自愈处理,提高故障解决的效率。
在故障定位方面,本申请实施例可采用自实现的关联处理算法和动态阈值算法,关联处理算法可例如根据分配的指标ID进行关联定位,可将原先独立的服务以及指标梳理成一个指标节点关系链,同时可根据动态阈值算法进行指标异常阈值的动态变更,动态阈值算法可根据配置的默认阈值进行不断训练学习,得到一个相对合理的指标异常阈值,进而可基于关联处理之后得到的指标节点关系链,进一步得到故障指标节点关系链,筛选出可能存在故障的指标节点,然后在故障修复方面,采用自实现的自愈处理算法,针对不同类型的异常故障及时处理,有效的提升了故障的恢复效率以及用户的体验。
本申请实施例提供的分布式网络的故障定位方法应用于分布式网络环境下,分布式网络包括可采用http/https协议的多个计算机。http/https协议用于网络数据发送和请求。
下面结合附图进行进一步论述。
图1是本申请实施例提供的一种分布式网络的故障定位方法的流程图。所述故障定位方法可应用于网络设备。参照图1,本申请实施例提供的分布式网络的故障定位方法可包括以下步骤。
步骤110,基于预设的节点信息文件,确定指标节点关系链;其中,所 述指标节点关系链包括多个指标节点,所述指标节点关系链用于反映各个指标节点之间的关联关系。
在本申请实施例中,预设的节点信息文件可以为任何可以用于确定指标节点关系链的文件。所述节点信息文件可包含指标节点对应的指标标识,以及指标标识之间的关联关系。其中,指标标识可以包括指标名称或者指标ID等。本步骤110中基于预设的节点信息文件,确定指标节点关系链可包括:基于所述节点信息文件中的指标标识,以及指标标识之间的关联关系,确定指标节点关系链;其中,一个指标标识对应于所述指标节点关系链中的一个指标节点。如此,可以直接利用预设的节点信息文件中指标标识之间的关联关系,快速构建指标节点关系链。当然,在本申请实施例中,预设的节点信息文件中还可以直接包含指标节点关系链,如此可以直接获取到此节点信息文件中的指标节点关系链。
在本申请实施例中,所述预设的节点信息文件可以为一个文件,也可以包括多个文件。在所述预设的节点信息文件为一个文件的情况下,所述预设的节点信息文件可以例如为用于构建指标节点关系链的节点配置文件。所述节点配置文件可包含指标节点对应的指标标识,以及指标标识之间的关联关系。在所述预设的节点信息文件包括多个文件的情况下,所述预设的节点信息文件例如可以包括节点配置文件和服务指标配置文件。其中,所述服务指标配置文件可包括服务名称以及每个服务名称之间的上下游关联关系,每个服务名称对应有至少一个指标标识。所述节点配置文件可包括存在所述上下游关联关系的相邻两个服务名称对应的指标标识之间的连接关系。相应地,在基于所述节点信息文件中的指标标识,以及指标标识之间的关联关系,确定指标节点关系链的过程中,可以根据所述节点配置文件中的所述连接关系,形成所述指标节点关系链。如此,节点配置文件和服务指标配置文件可以通过服务名称之间的上下游关联关系作为纽带相互关联,进而基于所述节点配置文件中的所述连接关系,形成所述指标节点关系链。这种通过多个文件确 定指标节点关系链的方式可以让各个文件侧重于自身的功能,能够在形成指标节点关系链的同时,方便于按照功能来查找和获取各个文件内的数据。
需指出的是,本申请实施例中的上下游关联关系可以表示相邻的两个对象之间的特定关联。以相邻的第一对象和第二对象为例,若第一对象的上游对象为第二对象,则可以表示第二对象引起所述第一对象的异常。此时,第一对象即为第二对象的下游对象。
在另一个实施例中,所述节点信息文件可例如包括节点配置文件和服务指标配置文件。所述节点配置文件包括第一服务标识以及与所述第一服务标识存在映射关系的指标标识,所述指标标识中存在至少一个目标指标标识。所述服务指标配置文件包括与所述目标指标标识存在关联关系的第二服务标识。其中,第一服务标识和第二服务标识均可以包括服务名称。在所述节点配置文件包含指标节点对应的指标标识,以及指标标识之间的关联关系的情况下,所述基于所述节点信息文件中的指标标识,以及指标标识之间的关联关系,确定指标节点关系链,可包括:获取所述节点配置文件中的第一服务标识以及与所述第一服务标识存在映射关系的指标标识;从所述服务指标配置文件中确定与所述指标标识中的目标指标标识对应的第二服务标识;基于所述第一服务标识、与所述第一服务标识存在映射关系的指标标识以及所述目标指标标识对应的第二服务标识,确定指标节点关系链;其中,一个第一服务标识和一个第二服务标识均对应于所述指标节点关系链中的一个节点。如此,节点配置文件和服务指标配置文件可以通过目标指标标识作为纽带相互关联,进而形成所述指标节点关系链。这种通过多个文件确定指标节点关系链的方式可以让各个文件侧重于自身的功能,能够在形成指标节点关系链的同时,方便于按照功能来查找和获取各个文件内的数据。
步骤120,采集目标对象的指标数据;所述目标对象包括所述多个指标节点。
在本申请实施例中,所述目标对象可以为所述多个指标节点,当然,所 述目标对象还可以包括除了所述多个指标节点之外的其他指标节点。
在本申请实施例中,所述指标数据(index data)可以包括指标节点的指标值。当然,除了指标节点的指标值之外,本申请实施例中的所述指标数据还可以包括以下至少一项:指标ID以及指标名称。其中,指标ID可以为编号的形式,例如1001。
在本申请实施例中,可以基于预设的节点信息文件,采集目标对象的指标数据。举例而言,所述预设的节点信息文件中可包含指标采集信息,所述指标采集信息可包括指标的描述、指标所在的服务名称、指标ID、指标采集的请求方式和请求参数。如此,可精确获取到具体服务实例下的具体指标数据。
步骤130,基于所述指标数据,从所述多个指标节点中确定异常指标节点。
在本申请实施例中,所述指标数据可包括所述多个指标节点的指标值,所述节点信息文件还可以包括指标异常阈值。相应地,步骤130中所述从所述多个指标节点中确定异常指标节点可包括:从所述多个指标节点中确定指标值大于所述指标异常阈值的指标节点;将指标值大于所述指标异常阈值的指标节点,确定为异常指标节点。如此,通过将采集到的指标数据中指标节点的指标值与所述节点信息文件中包含的指标异常阈值进行比对,即可快速确定出异常指标节点。
需指出的是,所述指标异常阈值可以为一个,也可以为多个。在指标异常阈值的数目为一个的情况下,可以将各个指标节点的指标值均与此指标异常阈值进行比对。在指标异常阈值为多个的情况下,指标异常阈值的数目可以与所述多个指标节点的数目相同。亦即,所述多个指标节点包括N个指标节点,N为大于1的正整数,所述指标异常阈值包括N个阈值,所述N个指标节点与所述N个阈值一一对应。在从所述多个指标节点中确定指标值大于所述指标异常阈值的指标节点的过程中,可以针对所述多个指标节点中的每 一个指标节点,判断该指标节点的指标值是否大于该指标节点对应的指标异常阈值。如此,由于一个指标异常阈值对应于一个指标节点,各个指标节点的指标异常阈值可以不同,可以保证获取到的异常指标节点的准确性更高。需了解的是,本申请实施例中的指标异常阈值可以周期性更新。也就是说,N个阈值中的全部或部分都可以周期性更新。
在所述预设的节点信息文件包括节点配置文件和服务指标配置文件的情况下,指标异常阈值可存在于服务指标配置文件中。同时,后文提到的指标修复信息也可存在于服务指标配置文件中。
步骤140,基于所述异常指标节点,从所述指标节点关系链中确定故障指标节点关系链。
在本申请实施例中,可以将所述指标节点关系链中存在的各个异常指标节点相互连接,形成故障指标节点关系链。
步骤150,从所述故障指标节点关系链中确定故障根因节点。
在本申请实施例中,步骤150中所述从所述故障指标节点关系链中确定故障根因节点可包括:将所述故障指标节点关系链中不存在上游指标节点的目标指标节点,作为故障根因节点;其中,所述上游指标节点包括所述故障指标节点关系链中引起所述目标指标节点异常的节点。如此,通过判断是否存在上游指标节点,即可快速地从所述故障指标节点关系链中确定故障根因节点。
本申请实施例对确定目标指标节点的过程不做限定。在一种可能的实施方式中,可以仅针对位于所述故障指标节点关系链中的端部的指标节点,确定是否存在上游指标节点,并将不存在上游指标节点的目标指标节点,作为故障根因节点。具体地,可以获取所述故障指标节点关系链中的第一指定节点,所述第一指定节点位于所述故障指标节点关系链的端部;从所述第一指定节点中确定出目标指标节点,其中,所述故障指标节点关系链中没有与所述目标指标节点存在映射关系的上游节点;所述上游节点包括引起所述目标 指标节点异常的节点;将所述目标指标节点确定为故障根因节点。当然,在本申请实施例中,也可以针对所述故障指标节点关系链中每一个指标节点,逐一确定是否存在上游指标节点,并将不存在上游指标节点的目标指标节点,作为故障根因节点。
在本申请实施例中,通过引入指标节点关系链,并基于指标节点关系链确定故障指标节点关系链,进而确定故障根因节点;一方面由于指标节点关系链用于反映各个节点之间的关联关系,如此可以从基于指标节点关系链得到的故障指标节点关系链中快速确定出故障根因节点;另一方面,整个方法是由网络设备执行而非人工完成,执行效率大大提升,进而解决了相关技术依赖人工进行故障定位的方式存在效率较低的问题。
下面结合具体的示意图进一步论述指标节点关系链和故障指标节点关系链。需了解的是,本申请实施例中的指标节点关系链中可以包括服务对应的节点,也可以不包括服务对应的节点,即仅包括指标标识对应的节点。
在本申请实施例中的指标节点关系链仅包括指标标识对应的节点的情况下,指标节点关系链的示例可参照图3-1。在如图3-1所示的指标节点关系链中,指标a-d、指标f-g、指标j-k、指标m和指标x均可对应于指标标识。指标x的上游指标节点为指标a、指标b和指标c。指标a的上游指标节点为指标d、指标f和指标g。指标c的上游指标节点为指标j、指标k和指标m。如上文所描述,在确定出异常指标节点指标x、指标a和指标g之后,可以得到如图3-2加粗部分所示例的一种故障指标节点关系链,即指标x、指标a、和指标g形成故障指标节点关系链。
在本申请实施例中的指标节点关系链包括指标标识对应的节点和服务对应的节点的情况下,指标节点关系链的示例可参照图4-1。在如图4-1所示的指标节点关系链中,指标a-d、指标f-g、指标j-k和指标m均可对应于指标标识。服务A下采集的指标有指标a、指标b和指标c。服务B下采集的指标有指标d、指标f和指标g。服务C下采集的指标有指标j、指标k和指标 m。指标a的上游节点为服务B,指标c的上游节点为服务C。如上文所描述,在确定出异常指标节点之后,可以得到如图4-2加粗部分所示例的一种故障指标节点关系链,即服务A、指标a、服务B和指标g形成故障指标节点关系链。
对比图3-2所示的故障指标节点关系链和图4-2所示的故障指标节点关系链可知,图4-2中存在服务对应的节点和指标标识对应的节点,而图3-2仅存在指标标识对应的节点,而不存在服务对应的节点,这些示例均在本申请实施例的范围内。
本申请实施例提供的故障定位方法不仅可以确定出故障根因节点,还可以对故障进行修复处理。在对故障进行修复处理的情形下,所述节点信息文件可包括:所述故障根因节点的指标修复信息。在一个示例中,在所述从所述故障指标节点关系链中确定故障根因节点之后,本申请实施例提供的故障定位方法还可以包括:利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复。如此,通过利用所述节点信息文件中所述故障根因节点的指标修复信息进行故障修复,可以实现对故障的自愈处理,大大减小甚至消除人为干预进行的故障修复操作,提高了工作效率。
需了解的是,在利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复之后,在一些情形下,可能还存在其他异常指标节点,此时若不对这些异常指标节点进行修复,则分布式网络中仍然会存在故障。因而,在一个实施例中,在所述利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复之后,本申请实施例提供的故障定位方法还可包括:在任意一个目标条件均不满足的情况下,重新确定故障根因节点,并对重新确定的故障根因节点进行修复,直至至少满足所述目标条件任意之一;其中,所述目标条件包括:故障根因节点的指标修复信息为空,故障根因节点修复失败,或者,故障根因节点修复后所述故障指标节点关系链中各个节点均恢复正常。其中,重新确定故障根因节点的过程可参照前文描述。需了 解的是,在确定的故障根因节点的修复信息为空或者重新确定的故障根因节点修复失败的情形下,可发出告警提示信息,所述告警提示信息用于指示对修复失败的节点进行人工修复。其中,所述告警提示信息可以为文字的形式,图像的形式,也可以为声音的形式。
当然,在本申请实施例中,在利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复之后,在一些情形下,分布式网络中可能不存在故障了,此时可以无需再重新确定故障根因节点。
本申请实施例可以仅针对故障根因节点进行修复,也可以对故障指标节点关系链中除了故障根因节点之外的故障指标节点进行修复。下面对故障指标节点关系链中的一个或多个故障指标节点进行修复的方式进行论述。在此情形下,所述节点信息文件可包括:各个指标节点的指标修复信息。在从所述故障指标节点关系链中确定故障根因节点之后,本申请实施例提供的故障定位方法还可包括以下步骤。
从所述故障指标节点关系链中的第一目标节点开始往第二目标节点的方向,依次获取各个节点的指标修复信息;所述第一目标节点为所述故障指标节点关系链中的故障根因节点,所述第一目标节点和所述第二目标节点分别位于所述故障指标节点关系链的两端;利用各个节点的指标修复信息从所述第一目标节点开始依次对各个节点进行修复。
如此,通过这种修复故障指标节点关系链中的多个指标节点的方式,可以减小重新确定故障根因节点的工作量,提高修复的成功率。
在本申请实施例中,利用各个节点的指标修复信息从所述第一目标节点开始依次对各个节点进行修复的过程可以包括:利用第三目标节点的指标修复信息对所述第三目标节点进行修复;在所述第三目标节点修复成功的情况下,确定与所述第三目标节点存在关联关系的第四目标节点;利用第四目标节点的指标修复信息对所述第四目标节点进行修复;其中,所述第三目标节点和所述第四目标节点为所述故障指标节点关系链中的两个相邻节点。如此, 针对所述故障指标节点关系链中的任意两个相邻节点,都可以通过这种先修复上游节点的方式进行故障修复,可以进一步提高修复的成功率。
需了解的是,第三目标节点可以为所述故障指标节点关系链中的故障根因节点,也可以为所述故障指标节点关系链中除了所述故障根因节点之外的节点。在一个实施例中,第四目标节点的异常可包括由第三目标节点引起的异常,在此情况下,所述第三目标节点修复成功,第四目标节点的异常可能消失,也可能不消失,例如第四目标节点的异常除了包括由第三目标节点引起的异常,还可能包括其他因素引起的异常,例如外部环境因素导致的异常。也就是说,所述第三目标节点修复成功并不代表第四目标节点的异常一定会消失。在第四目标节点仍然出现异常的情况下,可以利用第四目标节点的指标修复信息进行修复。当然,在本申请实施例中,在所述第三目标节点修复成功之后,且第四目标节点未出现异常的情况下,仍可以利用第四目标节点的指标修复信息进行修复。
因而,可以了解的是,在上面的修复过程中,在确定第四目标节点之后,无论第四目标节点是否出现故障,都可以直接利用第四目标节点的指标修复信息进行修复。如此,可以保证以统一的修复方式进行修复,提高修复效率。
图2是本申请实施例提供的一种分布式网络的故障定位方法的流程图。所述故障定位方法可应用于网络设备。参照图2,本申请实施例提供的分布式网络的故障定位方法可包括以下步骤。
步骤210,基于预设的节点信息文件,确定指标节点关系链;其中,所述指标节点关系链包括多个指标节点,所述指标节点关系链用于反映各个指标节点之间的关联关系。
步骤220,采集目标对象的指标数据;所述目标对象包括所述多个指标节点。
步骤230,基于所述指标数据,从所述多个指标节点中确定异常指标节点。
步骤240,基于所述异常指标节点,从所述指标节点关系链中确定故障指标节点关系链。
步骤250,从所述故障指标节点关系链中确定故障根因节点。
步骤260,利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复。
步骤270,在任意一个目标条件均不满足的情况下,重新确定故障根因节点,并对重新确定的故障根因节点进行修复,直至至少满足所述目标条件任意之一。
其中,所述目标条件包括:故障根因节点的指标修复信息为空,故障根因节点修复失败,或者,故障根因节点修复后所述故障指标节点关系链中各个节点均恢复正常。
其中,步骤210至步骤250的内容可参照前文描述。
需了解的是,在本申请的另一实施例中,在步骤250之后,可以不执行步骤260和步骤270,而是执行如下过程:从所述故障指标节点关系链中的第一目标节点开始往第二目标节点的方向,依次获取各个节点的指标修复信息;所述第一目标节点为所述故障指标节点关系链中的故障根因节点,所述第一目标节点和所述第二目标节点分别位于所述故障指标节点关系链的两端;利用各个节点的指标修复信息从所述第一目标节点开始依次对各个节点进行修复。同时,所述利用各个节点的指标修复信息从所述第一目标节点开始依次对各个节点进行修复,包括:利用第三目标节点的指标修复信息对所述第三目标节点进行修复;在所述第三目标节点修复成功的情况下,确定与所述第三目标节点存在关联关系的第四目标节点;利用第四目标节点的指标修复信息对所述第四目标节点进行修复;其中,所述第三目标节点和所述第四目标节点为所述故障指标节点关系链中的两个相邻节点。
由上可知,本申请实施例在确定故障根因节点之后,可以每次仅对确定的故障根因节点进行修复,也可以每次都修复包括故障根因节点在内的多个 节点。
在本申请实施例中,通过引入指标节点关系链,并基于指标节点关系链确定故障指标节点关系链,进而确定故障根因节点;一方面由于指标节点关系链用于反映各个节点之间的关联关系,如此可以从基于指标节点关系链得到的故障指标节点关系链中快速确定出故障根因节点;另一方面,整个方法是由网络设备执行而非人工完成,执行效率大大提升,进而解决了相关技术依赖人工进行故障定位的方式存在效率较低的问题。此外,通过利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复,并在修复之后,重新确定故障根因节点,并对对重新确定的故障根因节点进行修复,直至满足目标条件,如此可以快速定位故障,并实现对故障的自愈处理,大大减小甚至消除人为干预进行的故障修复操作,提高了工作效率。
下面以预设的节点信息文件包括节点配置文件和服务指标配置文件为例来进一步详细论述本申请实施例提供的分布式网络的故障定位方法。需了解的是,下文中所描述的内容仅是示例,而非限制。
本申请实施例提供的分布式网络的故障定位方法的过程可如下。
配置两个重要的配置文件:服务指标配置文件和分布式平台的节点配置文件,并且确认所配置信息的准确性。其中,服务指标配置文件可包含当前服务存在的异常种类,每个种类下面可包含多个指标信息,每个指标信息可包括指标ID(唯一不重复)、指标异常阈值、引起该指标异常的上游服务名称、指标修复信息。具体相关内容的一种示例可如下所示。

由上可知,当前服务的异常有且只有Runtime Exception这一种类型(当然服务可以有多种异常类型,内容格式可以都一样),引发这个异常的相关指标有两个1000和1002,同时每个指标进行修复的建议是不一样的。针对1000这个指标修复的建议有三个步骤(每个步骤都有对应的处理逻辑):check data、clean data、restart,并且依赖的服务为hbase。如果1000的指标有问题,可先检查该指标是否有依赖服务,如果有依赖服务,暂时不执行本指标的修复步骤,继续查找对应依赖服务的故障指标,直至找到最后一个故障根因指标,执行该故障根因指标的修复步骤,然后再查看相关上下游服务指标是否都已经恢复。举例而言,当前指标1000出现故障,检查他的依赖服务hbase相关指标,并修复,如果hbase故障指标已经修复,接下来再执行当前的1000的修复步骤。针对1002这个指标修复建议是null(空信息,表示暂时没有自动修复的建议步骤),则会发送一条告警发送给平台,交由运维人员进行进一步确认分析处理;如果关联服务服务信息为空,则表示该指标已经是最后一个根因指标,然后查看当前修复步骤,如果修复步骤为null,则发送告警给平台进行进一步分析。
在本申请实施例中,分布式平台的节点配置文件可包含分布式平台的访 问接入信息,指标列表(服务指标配置文件中可以将所有指标汇总)。列表中每个元素表示一个具体的指标采集信息,包括指标的描述、指标所在的服务名、指标ID(此指标ID可以和服务指标配置文件中包含的指标ID相同,以用来进行关联处理逻辑)、指标采集的请求方式和请求参数(请求参数中包含集群名、主机名、服务实例名以用来精确到具体服务实例下的具体指标)。具体相关内容的一种示例可如下所示。
由上可得到平台的登录信息和服务指标的汇总。需了解的是,上文只举了一个指标的示例,本示例中的指标为1001,来自于服务service-A,type、 url、requestBody内容表示访问该指标所需的采集信息,可根据这些信息收集到具体的指标值,以用来和服务指标配置文件中的指标异常阈值比较,判断该指标是否存在异常。
配置文件配置完成之后,可部署在诸如平台控制节点之类的网络设备上。
加载配置文件,并在加载完配置文件之后,可根据配置文件的指标信息进行采集(采集的方式可以是定时的,例如默认为1个小时),采集得到的指标数据格式可如下所示。
在上面的示例中,Server-A这个服务下采集的指标有三个index-a、index-b、index-c以及对应的指标ID和采集的指标值。
采集完指标数据之后,可首先调用关联处理算法逻辑,根据服务指标配 置文件中引起指标异常的上游服务名称这一个配置信息,得到关系链中的上一个节点。举例而言,如果关于Server-A服务的index-a指标的上游服务是Server-B,index-c指标上游服务是Server-C,可得到如图4-1所示的一个指标节点关系链。需了解的是,图4-1中所示的服务节点,即服务A节点、服务B节点和服务C节点,仅是一种示例。在本申请实施例中,也可以不存在服务节点,而是均以指标节点的形式呈现指标节点关系链。
将得到的指标节点关系链上每个指标的值与各自异常阈值做比对,超过阈值的指标可认为存在故障,最终得到一个如图4-2中加粗部分所示的故障指标节点关系链。其中,在进行比对之前,可先将指标异常阈值进行初始化配置,然后下次定时任务执行的时候,可对阈值进行默认值训练学习,更新得到一个新的异常阈值。也就是说,指标异常阈值可以是动态调整的阈值。
从故障指标节点关系链中找到最后一个故障根因节点,例如图4-2中的服务B的指标g。然后,可以用自愈处理算法从服务指标配置文件中查找故障根因节点(例如,指标g)的指标修复信息。如果故障根因节点对应的指标修复信息为空,则表示该指标无法自动修复,则会产生一条告警提示信息发送给平台,交由运维人员进行进一步确认分析处理。若指标修复信息不为空,则在自愈处理逻辑中匹配对应的修复操作。如果执行修复操作失败,则也会生成一条告警提示信息发送给平台,交由运维人员进行定位处理;若修复成功,则表示自愈成功。需了解的是,图4-2中所示的服务节点,即服务A节点、服务B节点和服务C节点,仅是一种示例。在本申请实施例中,也可以不存在服务节点,而是均以指标节点的形式呈现指标节点关系链。
对故障根因节点的修复执行完成后,可定时检查故障指标节点关系链。若故障指标节点关系链上各个节点均已恢复,则可不做任何处理。若其中某个节点依旧出现故障,则可在任意一个目标条件均不满足的情况下,重新确定故障根因节点,并对重新确定的故障根因节点进行修复,直至至少满足所述目标条件任意之一;其中,所述目标条件包括:故障根因节点的指标修复 信息为空,故障根因节点修复失败,或者,故障根因节点修复后所述故障指标节点关系链中各个节点均恢复正常。在重新确定的故障根因节点的修复信息为空或者重新确定的故障根因节点修复失败的情况下,可生产告警提示信息,运维人员处理完告警提示信息之后,整个故障定位以及自愈处理流程可结束。
需指出的是,如果部署过程中新增一个或者多个分布式服务到分布式网络上,只需在服务指标配置文件和节点配置文件这两个配置文件中进行添加设置操作,配置文件加载只会即可生效。
结合上文描述可知,本申请实施例提供的分布式网络的故障定位方法存在以下有益效果:将分布式网络中的故障信息进行了关联,这样可以通过故障关联关系自动识别出引入故障的根因出处;故障定位确认后,根据加载的配置文件进行自愈修复操作,尽量减少人为干预处理步骤;支持横向扩展,即新增服务之后,只需配置对应的配置文件之后,并由运维人员确认后重新加载即可生效。
图5是本申请实施例提供的一种分布式网络的故障定位装置的结构框图。参照图5,本申请实施例提供的分布式网络的故障定位装置可包括:确定模块510以及采集模块520。
确定模块510,用于基于预设的节点信息文件,确定指标节点关系链;其中,所述指标节点关系链包括多个指标节点,所述指标节点关系链用于反映各个指标节点之间的关联关系。
采集模块520,用于采集目标对象的指标数据;所述目标对象包括所述多个指标节点。
所述确定模块510,还用于基于所述指标数据,从所述多个指标节点中确定异常指标节点;基于所述异常指标节点,从所述指标节点关系链中确定故障指标节点关系链;从所述故障指标节点关系链中确定故障根因节点。
在本申请实施例提供的分布式网络的故障定位装置中,通过引入指标节 点关系链,并基于指标节点关系链确定故障指标节点关系链,进而确定故障根因节点;一方面由于指标节点关系链用于反映各个节点之间的关联关系,如此可以从基于指标节点关系链得到的故障指标节点关系链中快速确定出故障根因节点;另一方面,整个过程是由故障定位装置执行而非人工完成,执行效率大大提升,进而解决了相关技术依赖人工进行故障定位的方式存在效率较低的问题。
在本申请的一个实施例中,所述节点信息文件包含指标节点对应的指标标识,以及指标标识之间的关联关系。在所述基于预设的节点信息文件,确定指标节点关系链的过程中,所述确定模块510具体用于:基于所述节点信息文件中的指标标识,以及指标标识之间的关联关系,确定指标节点关系链;其中,一个指标标识对应于所述指标节点关系链中的一个指标节点。
在本申请的一个实施例中,所述节点信息文件包括节点配置文件和服务指标配置文件;所述服务指标配置文件包括服务名称,以及每个服务名称之间的上下游关联关系,每个服务名称对应有至少一个指标标识;所述节点配置文件包括存在所述上下游关联关系的相邻两个服务名称对应的指标标识之间的连接关系。在所述基于所述节点信息文件中的指标标识,以及指标标识之间的关联关系,确定指标节点关系链的过程中,所述确定模块510具体用于:根据所述节点配置文件中的所述连接关系,形成所述指标节点关系链。
在本申请的一个实施例中,所述指标数据包括所述多个指标节点的指标值,所述节点信息文件还包括:每个指标节点对应的指标异常阈值;其中,所述多个指标节点包括N个指标节点,N为大于1的正整数,所述指标异常阈值包括N个阈值,所述N个指标节点与所述N个阈值一一对应。在所述从所述多个指标节点中确定异常指标节点的过程中,所述确定模块510具体用于:针对所述多个指标节点中的每一个指标节点,判断该指标节点的指标值是否大于该指标节点对应的指标异常阈值;将该指标节点的指标值大于该指标节点对应的指标异常阈值的指标节点,确定为异常指标节点。
在本申请的一个实施例中,在所述从所述故障指标节点关系链中确定故障根因节点的过程中,所述确定模块510具体用于:将所述故障指标节点关系链中不存在上游指标节点的目标指标节点,作为故障根因节点。其中,所述上游指标节点包括所述故障指标节点关系链中引起所述目标指标节点异常的节点。
在本申请的一个实施例中,所述节点信息文件还包括:所述故障根因节点的指标修复信息。本申请实施例提供的分布式网络的故障定位装置还包括:修复模块,用于在确定模块510从所述故障指标节点关系链中确定故障根因节点之后,利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复。
在本申请的一个实施例中,在所述修复模块利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复之后,所述确定模块510还用于:在任意一个目标条件均不满足的情况下,重新确定故障根因节点,并对重新确定的故障根因节点进行修复,直至至少满足所述目标条件任意之一;其中,所述目标条件包括:故障根因节点的指标修复信息为空,故障根因节点修复失败,或者,故障根因节点修复后所述故障指标节点关系链中各个节点均恢复正常。
在本申请的一个实施例中,所述节点信息文件还包括:各个指标节点的指标修复信息。在所述确定模块从所述故障指标节点关系链中确定故障根因节点之后,本申请实施例提供的分布式网络的故障定位装置还包括:获取模块,用于从所述故障指标节点关系链中的第一目标节点开始往第二目标节点的方向,依次获取各个节点的指标修复信息;所述第一目标节点为所述故障指标节点关系链中的故障根因节点,所述第一目标节点和所述第二目标节点分别位于所述故障指标节点关系链的两端;修复模块,用于利用各个节点的指标修复信息从所述第一目标节点开始依次对各个节点进行修复。
在本申请的一个实施例中,在所述利用各个节点的指标修复信息从所述 第一目标节点开始依次对各个节点进行修复的过程中,所述修复模块具体用于:利用第三目标节点的指标修复信息对所述第三目标节点进行修复;在所述第三目标节点修复成功的情况下,确定与所述第三目标节点存在关联关系的第四目标节点;利用第四目标节点的指标修复信息对所述第四目标节点进行修复;其中,所述第三目标节点和所述第四目标节点为所述故障指标节点关系链中的两个相邻节点。
需要说明的是,本申请实施例提供的分布式网络的故障定位装置与上文提到的分布式网络的故障定位方法相对应。相关内容可参照上文对方法的描述,在此不做赘述。
此外,如图6所示,本申请实施例还提供一种网络设备600,所述网络设备可以为各种类型的计算机等。所述网络设备600包括:处理器610和存储器620,所述存储器620上存储程序或指令,所述程序或指令被所述处理器610执行时实现上文所描述的任一种方法的步骤。
本申请实施例还提供一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被所述处理器执行时实施上文所描述的任一种方法的步骤。
在本申请实施例提供的网络设备或者可读存储介质中,所述程序或指令的层级可如下:第一层为应用服务层,主要是对外提供使用功能;第二层为业务服务层,主要是进行内部逻辑调用,本申请实施例中主要是分为如下三个部分,关联处理算法逻辑模块、动态阈值算法逻辑模块、自愈处理逻辑模块;第三层为平台服务层,主要是平台提供的一些逻辑调用,在本申请实施例中就是平台提供了查询接口查询每一个服务指标信息,以及平台为每一个服务指标分配一个唯一的服务ID;第四层是基础服务层,主要是本申请实施例中涉及到的分布式系统所需要的底层基础服务来作为支撑,例如分布式框架Spring Boot、Spring Cloud以及其他分布式框架和通信框架。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或 计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由 任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (11)

  1. 一种分布式网络的故障定位方法,应用于网络设备,所述方法包括:
    基于预设的节点信息文件,确定指标节点关系链;其中,所述指标节点关系链包括多个指标节点,所述指标节点关系链用于反映各个指标节点之间的关联关系;
    采集目标对象的指标数据;所述目标对象包括所述多个指标节点;
    基于所述指标数据,从所述多个指标节点中确定异常指标节点;
    基于所述异常指标节点,从所述指标节点关系链中确定故障指标节点关系链;
    从所述故障指标节点关系链中确定故障根因节点。
  2. 根据权利要求1所述的方法,其中,所述节点信息文件包含指标节点对应的指标标识,以及指标标识之间的关联关系;
    所述基于预设的节点信息文件,确定指标节点关系链,包括:
    基于所述节点信息文件中的指标标识,以及指标标识之间的关联关系,确定指标节点关系链;
    其中,一个指标标识对应于所述指标节点关系链中的一个指标节点。
  3. 根据权利要求2所述的方法,其中,所述节点信息文件包括节点配置文件和服务指标配置文件;所述服务指标配置文件包括服务名称,以及每个服务名称之间的上下游关联关系,每个服务名称对应有至少一个指标标识;所述节点配置文件包括存在所述上下游关联关系的相邻两个服务名称对应的指标标识之间的连接关系;
    所述基于所述节点信息文件中的指标标识,以及指标标识之间的关联关系,确定指标节点关系链,包括:
    根据所述节点配置文件中的所述连接关系,形成所述指标节点关系链。
  4. 根据权利要求1-3任一项所述的方法,其中,所述指标数据包括所述多个指标节点的指标值,所述节点信息文件还包括:每个指标节点对应的指 标异常阈值;其中,所述多个指标节点包括N个指标节点,N为大于1的正整数,所述指标异常阈值包括N个阈值,所述N个指标节点与所述N个阈值一一对应;
    所述从所述多个指标节点中确定异常指标节点,包括:
    针对所述多个指标节点中的每一个指标节点,判断该指标节点的指标值是否大于该指标节点对应的指标异常阈值;将该指标节点的指标值大于该指标节点对应的指标异常阈值的指标节点,确定为异常指标节点。
  5. 根据权利要求1所述的方法,其中,所述从所述故障指标节点关系链中确定故障根因节点,包括:
    将所述故障指标节点关系链中不存在上游指标节点的目标指标节点,作为故障根因节点;
    其中,所述上游指标节点包括所述故障指标节点关系链中引起所述目标指标节点异常的节点。
  6. 根据权利要求1所述的方法,其中,所述节点信息文件还包括:所述故障根因节点的指标修复信息;在所述从所述故障指标节点关系链中确定故障根因节点之后,所述方法还包括:
    利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复。
  7. 根据权利要求6所述的方法,其中,在所述利用所述故障根因节点的指标修复信息对所述故障根因节点进行故障修复之后,所述方法还包括:
    在任意一个目标条件均不满足的情况下,重新确定故障根因节点,并对重新确定的故障根因节点进行修复,直至至少满足所述目标条件任意之一;
    其中,所述目标条件包括:故障根因节点的指标修复信息为空,故障根因节点修复失败,或者,故障根因节点修复后所述故障指标节点关系链中各个节点均恢复正常。
  8. 根据权利要求1所述的方法,其中,所述节点信息文件还包括:各个 指标节点的指标修复信息;
    在所述从所述故障指标节点关系链中确定故障根因节点之后,所述方法还包括:
    从所述故障指标节点关系链中的第一目标节点开始往第二目标节点的方向,依次获取各个节点的指标修复信息;所述第一目标节点为所述故障指标节点关系链中的故障根因节点,所述第一目标节点和所述第二目标节点分别位于所述故障指标节点关系链的两端;
    利用各个节点的指标修复信息从所述第一目标节点开始依次对各个节点进行修复。
  9. 根据权利要求8所述的方法,其中,所述利用各个节点的指标修复信息从所述第一目标节点开始依次对各个节点进行修复,包括:
    利用第三目标节点的指标修复信息对所述第三目标节点进行修复;
    在所述第三目标节点修复成功的情况下,确定与所述第三目标节点存在关联关系的第四目标节点;
    利用第四目标节点的指标修复信息对所述第四目标节点进行修复;
    其中,所述第三目标节点和所述第四目标节点为所述故障指标节点关系链中的两个相邻节点。
  10. 一种网络设备,包括处理器和存储器,所述存储器存储程序或指令,所述程序或指令被所述处理器执行时实现如权利要求1-9任一项所述的方法的步骤。
  11. 一种可读存储介质,所述存储介质上存储程序或指令,所述程序或指令被执行时实现如权利要求1-9任一项所述的方法的步骤。
PCT/CN2024/071154 2023-02-14 2024-01-08 分布式网络的故障定位方法、网络设备和存储介质 WO2024169467A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310156176.5 2023-02-14
CN202310156176.5A CN118509308A (zh) 2023-02-14 2023-02-14 分布式网络的故障定位方法、网络设备和存储介质

Publications (1)

Publication Number Publication Date
WO2024169467A1 true WO2024169467A1 (zh) 2024-08-22

Family

ID=92235065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/071154 WO2024169467A1 (zh) 2023-02-14 2024-01-08 分布式网络的故障定位方法、网络设备和存储介质

Country Status (2)

Country Link
CN (1) CN118509308A (zh)
WO (1) WO2024169467A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022028120A1 (zh) * 2020-08-06 2022-02-10 中兴通讯股份有限公司 指标检测模型获取及故障定位方法、装置、设备及存储介质
CN114710397A (zh) * 2022-04-24 2022-07-05 中国工商银行股份有限公司 服务链路的故障根因定位方法、装置、电子设备及介质
EP4044516A1 (en) * 2019-11-29 2022-08-17 ZTE Corporation Fault locating method, apparatus and device, and storage medium
CN115514627A (zh) * 2022-09-21 2022-12-23 深信服科技股份有限公司 一种故障根因定位方法、装置、电子设备及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4044516A1 (en) * 2019-11-29 2022-08-17 ZTE Corporation Fault locating method, apparatus and device, and storage medium
WO2022028120A1 (zh) * 2020-08-06 2022-02-10 中兴通讯股份有限公司 指标检测模型获取及故障定位方法、装置、设备及存储介质
CN114710397A (zh) * 2022-04-24 2022-07-05 中国工商银行股份有限公司 服务链路的故障根因定位方法、装置、电子设备及介质
CN115514627A (zh) * 2022-09-21 2022-12-23 深信服科技股份有限公司 一种故障根因定位方法、装置、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN118509308A (zh) 2024-08-16

Similar Documents

Publication Publication Date Title
CN110825420A (zh) 分布式集群的配置参数更新方法、装置、设备及存储介质
WO2019006654A1 (zh) 金融自助设备维修派单生成方法、手持终端及电子设备
CN110932910B (zh) 一种软件故障的日志记录方法及装置
CN111857998A (zh) 一种可配置的定时任务调度方法及系统
US10078655B2 (en) Reconciling sensor data in a database
CN109189758B (zh) 运维流程设计方法、装置和设备、运行方法、装置和主机
US20210160142A1 (en) Generalized correlation of network resources and associated data records in dynamic network environments
US20200293310A1 (en) Software development tool integration and monitoring
US8631279B2 (en) Propagating unobserved exceptions in distributed execution environments
CN111782456B (zh) 异常检测方法、装置、计算机设备和存储介质
JP2017069895A (ja) 障害切り分け方法および障害切り分けを行う管理サーバ
CN113704790A (zh) 一种异常日志信息汇总方法及计算机设备
CN111046007A (zh) 管理存储系统的方法、装置和计算机程序产品
CN117389830A (zh) 集群日志采集方法、装置、计算机设备及存储介质
CN113282606A (zh) 数据处理方法、装置、存储介质和计算设备
WO2024169467A1 (zh) 分布式网络的故障定位方法、网络设备和存储介质
JP2012089049A (ja) 計算機システム及びサーバ
CN111162938A (zh) 数据处理系统及方法
CN108154343B (zh) 一种企业级信息系统的应急处理方法及系统
CN115426356A (zh) 一种分布式定时任务锁更新控制执行方法和装置
JP7392852B2 (ja) ルール生成装置、ルール生成方法およびプログラム
CN115577160A (zh) 一种生产线数据采集方法、装置、设备及介质
CN115174350A (zh) 一种运维告警方法、装置、设备及介质
CN110764882B (zh) 分布式管理方法、分布式管理系统及装置
WO2017124726A1 (zh) 一种北向文件生成的方法、装置及系统