CN110166264B - Fault positioning method and device and electronic equipment - Google Patents

Fault positioning method and device and electronic equipment Download PDF

Info

Publication number
CN110166264B
CN110166264B CN201810142390.4A CN201810142390A CN110166264B CN 110166264 B CN110166264 B CN 110166264B CN 201810142390 A CN201810142390 A CN 201810142390A CN 110166264 B CN110166264 B CN 110166264B
Authority
CN
China
Prior art keywords
node
target
performance index
index
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810142390.4A
Other languages
Chinese (zh)
Other versions
CN110166264A (en
Inventor
陈涛
刘宏伟
郭永强
王文浩
刘庆文
龚炎
崔大壮
秦强强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201810142390.4A priority Critical patent/CN110166264B/en
Publication of CN110166264A publication Critical patent/CN110166264A/en
Application granted granted Critical
Publication of CN110166264B publication Critical patent/CN110166264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a fault positioning method and device and electronic equipment. The method comprises the following steps: obtaining a target index value of a performance index corresponding to each node in a service system; calculating the abnormal evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index; when judging that at least one target node exists in the service system based on the abnormal evaluation score, determining a node generating a fault root factor from the at least one target node based on a target node link where the at least one target node is located; the target node is a node with an abnormal evaluation score meeting a preset abnormal condition. The method and the system can quickly and effectively locate the node generating the fault root cause in the service system.

Description

Fault positioning method and device and electronic equipment
Technical Field
The embodiment of the application relates to the field of fault diagnosis, in particular to a fault positioning method and device and electronic equipment.
Background
For a business system which has a lot of functions and is deployed by adopting a distributed scene, such as a take-away system, the business system has the characteristics of complex business relation, various subsystems, long calling chain among systems and the like. Specifically, a large number of nodes exist in the service system, and some nodes have a calling relationship, where the nodes may be: an interface (i.e., a particular piece of code), a service (i.e., code that implements a function, including multiple interfaces or methods), or a database, etc.
When a service system fails, a node generating a failure root factor is located in the prior art by adopting a manual troubleshooting mode. However, during the service system failure, the alarm information is lost, and a large amount of useful information is submerged, so that the positioning Time for manually positioning the failure is long and difficult, and further the Mean Time To Recovery (MTTR) is deteriorated To the minute level or even the hour level.
Therefore, how to quickly and effectively locate the node generating the fault root cause in the service system is an urgent problem to be solved.
Disclosure of Invention
In view of this, embodiments of the present application provide a fault location method, an apparatus, and an electronic device, so as to quickly and effectively locate a node generating a fault root in a service system.
Specifically, the embodiment of the application is realized by the following technical scheme:
in a first aspect, an embodiment of the present application provides a fault location method, including:
obtaining a target index value of a performance index corresponding to each node in a service system;
calculating the abnormal evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index;
when judging that at least one target node exists in the service system based on the abnormal evaluation score, determining a node generating a fault root factor from the at least one target node based on a target node link where the at least one target node is located;
the target node is a node with an abnormal evaluation score meeting a preset abnormal condition.
Optionally, the step of obtaining a target index value of the performance index corresponding to each node in the service system includes:
and periodically obtaining a target index value of the performance index corresponding to each node in the service system.
Optionally, the step of calculating an anomaly evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index includes:
aiming at each performance index, calculating an abnormal evaluation score of the performance index based on a target index value of the performance index and corresponding baseline data;
and aiming at each node, calculating the abnormal evaluation score of the node based on the abnormal evaluation scores of the performance indexes corresponding to the node.
Optionally, the step of calculating, for each performance index, an abnormality assessment score of the performance index based on the target index value of the performance index and the corresponding baseline data includes:
aiming at each performance index, calculating an abnormal evaluation score of the performance index based on a target index value of the performance index and corresponding baseline data by using a preset abnormal evaluation model;
wherein the abnormality evaluation model is:
Gi=Fi*fi
wherein, Fi=|Ci-Mi|/Mi,GiEvaluation of the abnormality score for the performance index i, fiIs a preset basic score, C, corresponding to the performance index iiTarget index value, M, which is a performance index iiIs the baseline data corresponding to the performance index i.
Optionally, the step of determining a node generating a failure root cause from the at least one target node based on a target node link where the at least one target node is located includes:
determining a target node link where the at least one target node is located;
for each target node link, when one target node exists in the target node link, determining the existing one target node as the node generating the fault root cause; when at least two target nodes exist in the target node link, the node which is located at the most downstream of the at least two target nodes is determined as the node generating the fault root cause.
Optionally, the step of determining a node generating a failure root cause from the at least one target node based on a target node link where the at least one target node is located includes:
determining a target node link where the at least one target node is located;
guiding a manager to determine a node generating a fault root factor from the at least one target node by outputting a link diagram corresponding to the target node link; wherein the target node in the link graph is highlighted.
Optionally, the calculation manner of the baseline data corresponding to any performance index includes:
obtaining a historical index value corresponding to the performance index;
rejecting abnormal index values in the historical index values to obtain effective index values;
and calculating the baseline data corresponding to the performance index based on the effective index value according to the baseline calculation algorithm corresponding to the performance index.
In a second aspect, an embodiment of the present application provides a fault location device, including:
a target index value obtaining unit, configured to obtain a target index value of a performance index corresponding to each node in a service system;
the node score calculating unit is used for calculating the abnormal evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index;
the fault positioning unit is used for determining a node generating a fault root cause from at least one target node based on a target node link where the at least one target node is located when the at least one target node is judged to exist in the service system based on the abnormal evaluation score;
the target node is a node with an abnormal evaluation score meeting a preset abnormal condition.
Optionally, the target index value obtaining unit includes:
and the index value obtaining subunit is used for periodically obtaining a target index value of the performance index corresponding to each node in the service system.
Optionally, the node score calculating unit includes:
the index score calculating subunit is used for calculating the abnormal evaluation score of each performance index based on the target index value of the performance index and the corresponding baseline data;
and the node score calculating subunit is used for calculating the abnormal evaluation score of each node based on the abnormal evaluation score of each performance index corresponding to the node aiming at each node.
Optionally, the fault location unit comprises:
a link determining subunit, configured to determine a target node link where the at least one target node is located;
the first positioning subunit is used for determining, for each target node link, when one target node exists in the target node link, the existing one target node as the node generating the fault root cause; when at least two target nodes exist in the target node link, the node which is located at the most downstream of the at least two target nodes is determined as the node generating the fault root cause.
Optionally, the fault location unit comprises:
a link determining subunit, configured to determine a target node link where the at least one target node is located;
the second positioning subunit is used for guiding a manager to determine a node generating a fault root factor from the at least one target node by outputting a link map corresponding to the target node link; wherein the target node in the link graph is highlighted.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the fault location method provided in the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is configured to execute the fault location method provided in the first aspect.
In the method provided by the embodiment of the application, each node is subjected to anomaly scoring based on the target index value and the baseline data of the performance index corresponding to each node to obtain at least one target node with an anomaly, and the node generating the fault root cause is determined from the at least one target node based on the target node link where the at least one target node is located in consideration of the call relationship between the nodes. Therefore, the method and the device can quickly and effectively locate the node generating the fault root cause in the service system.
Drawings
Fig. 1 is a flowchart of a fault location method according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a fault location device according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In order to quickly and effectively locate a node generating a fault root cause in a service system, the embodiment of the application provides a fault locating method, a fault locating device and electronic equipment.
First, a fault location method provided in an embodiment of the present application is described below.
It should be noted that an execution subject of the fault location method provided in the embodiments of the present application may be a fault location device. It is reasonable that the fault location apparatus can be operated in an electronic device, and in a specific application, the electronic device can be a terminal device or a server.
In addition, based on the calling relationship of each node in the service system, a plurality of node links of the service system can be constructed in advance, and the node in any node link has the calling relationship. Wherein, the node can be an interface, a service or a database, etc.; and the node links may be constructed manually, although not limited thereto. Specifically, the interface is a section of a specific code method; a service is code that implements a function and may include a number of interfaces or methods.
Furthermore, it should be emphasized that the business system according to the embodiment of the present application is not limited to a takeaway system, and any business system in which a plurality of nodes exist and a calling relationship exists between the plurality of nodes may perform fault location by using the fault location method provided in the embodiment of the present application.
As shown in fig. 1, a fault location method provided in an embodiment of the present application may include the following steps:
s101, obtaining target index values of performance indexes corresponding to all nodes in a service system;
in view of that the index value of the performance index corresponding to the node is an important factor for judging whether the node has a fault, that is, the index value of the performance index corresponding to the node changes when the fault occurs, in the embodiment of the present application, the fault location device may obtain a target index value of the performance index corresponding to each node in the service system, and further perform subsequent processing based on the obtained target index value.
It can be understood that any node may correspond to one performance index or a plurality of performance indexes, the number and specific types of performance indexes corresponding to different types of nodes may be different or the same, and the number and specific types of performance indexes corresponding to the same type of nodes may be the same or different. In addition, in a specific application, the performance index corresponding to each type of node may be set according to an actual situation. For example:
the performance index corresponding to the interface may be one or more of TP99, QPM (Query Per Minute, Query rate Per Minute), failure rate, abnormal constant, and the like;
the performance index corresponding to the database may be: one or more of an isoconstant, failure rate, AVG (Average time consumption), QPM;
the performance index corresponding to the service may be: one or more of an iso-constant, failure rate, AVG, QPM, etc.;
wherein TP99 is: the lowest time required to satisfy ninety-nine percent of network requests; the failure rate is: number of call failures/total number of calls; the iso-constants are: the number of call failures.
In addition, it should be noted that, in a specific implementation manner, when receiving the alarm message about the service system, the fault location apparatus may execute a trigger condition for obtaining a target index value of the performance index corresponding to each node in the service system, that is, receiving the alarm message about the service system as the step of obtaining the target index value of the performance index corresponding to each node in the service system.
In order to further improve the efficiency of fault location, in another specific implementation manner, the fault location device may periodically obtain a target index value of a performance index corresponding to each node in the service system, that is, the fault location device monitors the service system, so that if the service system fails, the fault location device may complete fault location before or while receiving alarm information about the service system, thereby greatly improving the efficiency of fault location, and avoiding a fault location process caused by some false alarms. It can be understood that, in a specific application, the fault location apparatus may perform the step of obtaining the target index value of the performance index corresponding to each node in the service system every predetermined minute, that is, the periodicity of the minute level; of course, the step of obtaining the target index value of the performance index corresponding to each node in the service system may also be performed every predetermined hour, that is, the periodicity of the hour level, and the like.
Moreover, there are various specific ways for the fault location device to obtain the target index value of the performance index corresponding to each node in the service system. The fault location device may collect target index values of performance indexes from each node, for example: reading related log information of each node; or, the fault location device may collect target index values of the performance index from each node through a web crawler, where the web crawler may collect and store the index values of the performance index from each node in real time, and the fault location device reads the target index values from the stored data when fault location is needed, and so on.
S102, calculating an abnormal evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index;
after obtaining the target index value of the performance index corresponding to each node, the fault location device may calculate the abnormal evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index. The baseline data corresponding to any performance index is used for representing the normal level value of the performance index, and it can be understood that the actual index value of any performance index may be higher or lower than the baseline data by a certain amplitude.
Optionally, the step of calculating an anomaly evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index may include:
aiming at each performance index, calculating an abnormal evaluation score of the performance index based on a target index value of the performance index and corresponding baseline data;
and aiming at each node, calculating the abnormal evaluation score of the node based on the abnormal evaluation scores of the performance indexes corresponding to the node.
When calculating the abnormal evaluation score of each node, the abnormal evaluation scores of the performance indexes corresponding to the node may be added to obtain the abnormal evaluation score of the node; or the abnormal evaluation scores of the performance indexes corresponding to the node can be weighted and added to obtain the abnormal evaluation score of the node. It should be noted that, the weight value used for weighting and adding the abnormality evaluation scores of the performance indicators corresponding to the node may be a value determined according to an empirical value, and is not limited herein.
It is understood that a greater degree of deviation of the target metric value from the baseline data indicates a greater degree of abnormality of the performance metric and thus a greater degree of abnormality of the node, i.e., a value of degree of deviation of the target metric value from the baseline data is linearly related to the degree of abnormality value of the performance metric, the degree of abnormality value of the node. In order to represent the abnormal degree value by the score, in the embodiment of the application, a basic score is preset for each performance index, and then the performance index and the node abnormal evaluation score are calculated by using two elements, namely the basic score and the degree value of the target index value deviating from the baseline data. Specifically, the step of calculating the abnormality evaluation score of each performance index based on the target index value of the performance index and the corresponding baseline data may include:
aiming at each performance index, calculating an abnormal evaluation score of the performance index based on a target index value of the performance index and corresponding baseline data by using a preset abnormal evaluation model;
wherein, the abnormality evaluation model is as follows:
Gi=Fi*fi
wherein, Fi=|Ci-Mi|/Mi,GiEvaluation of the abnormality score for the performance index i, fiIs a preset basic score, C, corresponding to the performance index iiTarget index value, M, which is a performance index iiIs the baseline data corresponding to the performance index i.
It can be understood that the preset basic score corresponding to any performance index may be set according to specific situations, which is not limited in the present application.
It should be noted that the baseline data corresponding to any performance index may be fixed or may be updated periodically. In addition, the baseline data corresponding to any one of the performance indicators may be set based on empirical values. Of course, the baseline data corresponding to any performance index may also be calculated by the fault location device according to historical data, and specifically, the calculation manner of the baseline data corresponding to any performance index may include:
obtaining a historical index value corresponding to the performance index;
rejecting abnormal index values in the historical index values to obtain effective index values;
and calculating the baseline data corresponding to the performance index based on the effective index value according to the baseline calculation algorithm corresponding to the performance index.
It should be emphasized that the abnormal index value elimination algorithms corresponding to different performance indexes are different, and similarly, the baseline calculation algorithms corresponding to different performance indexes may be different. For the performance index without periodic rule, when the abnormal index value is eliminated, the adopted abnormal data processing algorithm may be MAD (mean Absolute Deviation) algorithm; the baseline calculation algorithm used in calculating the baseline data may be a quartile algorithm, but is not limited to this. For the performance index of the periodic regularity, the exception is eliminatedWhen the value is marked, the adopted abnormal data processing algorithm can be an MAD algorithm; the baseline calculation algorithm used in calculating the baseline data may be, but is not limited to, a K σ bias algorithm. For example: the index value of TP99 has no specific rule, and abnormal index values can be eliminated by using an MAD algorithm, and baseline data is calculated by adopting a quartile method; and index value distribution of QPM has periodicity, abnormal index values can be eliminated by using an MAD algorithm, and baseline data is calculated by using a K sigma deviation algorithm. Herein, the MAD (mean Absolute Deviation) is abbreviated as mean Deviation, and specifically, when the total number of units is N, there is a variable X1,X2,X3,……,XN-1,XNThe difference between each variable and the population mean is called the dispersion, and the mean absolute dispersion is defined as the mean of the absolute values of the dispersion of each data from the mean. The quartile algorithm is also called a boxplot, and is characterized in that the overall distribution of data is described by utilizing statistics such as a first quartile, a median and a third quartile in the data, and upper and lower boundary values of the data are calculated through the statistics to serve as baseline values. For the K σ bias algorithm, K may be set according to a scene, and the following bias algorithm is described by taking K ═ 3 as an example, specifically: firstly, a group of detection data is assumed to obey normal distribution or approximate normal distribution, the mean value and the standard deviation of the detection data are calculated by combining with statistical knowledge, and the baseline value is determined according to the data distribution characteristic of the normal distribution.
In addition, it is understood that whether the baseline data is set based on an empirical value or calculated based on historical data may be determined based on the index value distribution of the performance index.
It should be emphasized that the specific implementation manner of calculating the abnormal evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index is given as an example only, and should not be construed as a limitation to the embodiment of the present application.
S103, when judging that at least one target node exists in the service system based on the abnormal evaluation score, determining a node generating a fault root cause from the at least one target node based on a target node link where the at least one target node is located.
The target node is a node with an abnormal evaluation score meeting a preset abnormal condition.
After the abnormal evaluation scores of the nodes are obtained through calculation, whether the abnormal evaluation scores of the nodes meet preset abnormal conditions or not can be judged, and therefore target nodes meeting the preset abnormal conditions are obtained, wherein the target nodes meet the preset abnormal conditions, and therefore the target nodes are nodes with faults. Furthermore, when it is determined that at least one target node exists in the service system based on the anomaly evaluation score, it indicates that the service system fails, and therefore, the fault location device determines a node generating a fault root cause from the at least one target node based on a target node link where the at least one target node is located. The preset abnormal condition may be: above a predetermined score threshold, although not limited thereto; also, the predetermined score threshold may be 0, or a score value greater than 0, which may be set as the case may be.
Optionally, in a specific implementation manner, the step of determining, from the at least one target node, a node generating a failure root factor based on a target node link where the at least one target node is located may include:
determining a target node link where the at least one target node is located;
for each target node link, when one target node exists in the target node link, determining the existing one target node as the node generating the fault root cause; when at least two target nodes exist in the target node link, the node which is located at the most downstream of the at least two target nodes is determined as the node generating the fault root cause.
It is emphasized that for a node link, the most downstream nodes in the plurality of nodes are: changes in itself can affect other nodes in the plurality of nodes, such as: the nodes belonging to the lowest layer of the call among the plurality of nodes. For example: a plurality of nodes: A. b, C and D, assuming A is called by B, C and D, then A is the most downstream node in the plurality of nodes.
In addition, since one node may be located in multiple node links, when determining a target node link where the at least one target node is located, a target node link where the at least one target node is located and belongs to a core link may be determined, where the so-called core link is a link that is concerned by a manager or a link that is important in a service system, where the core link may be set manually or may be analyzed and set by the system itself.
Optionally, in another specific implementation manner, the step of determining, from the at least one target node, a node generating a failure root factor based on a target node link where the at least one target node is located may include:
determining a target node link where the at least one target node is located;
guiding a manager to determine a node generating a fault root factor from the at least one target node by outputting a link diagram corresponding to the target node link; wherein the target node in the link map is highlighted.
The target node may be highlighted by color, but is not limited thereto. Moreover, when the target node is highlighted by color, the icons of the respective target nodes may have the same color or different colors on the premise of being different from the colors of the icons of other normal nodes. It is to be understood that, in a specific application, in order to achieve a better distinguishing effect and to know the degree of abnormality, the color of the icon may be set to red for a target node whose abnormality evaluation score is higher than a predetermined value, and the color of the icon may be set to yellow for a target node whose abnormality evaluation score is lower than a predetermined value, so that a manager can know the degree of abnormality of the target node through the degree of prominence of the color.
The fault locator may directly output the link map, or may output link information corresponding to the link map, but is not limited to this. And for the mode of outputting the link information, the manager can click the link information and enter the display interface of the link map, so that the manager can view the link map.
In the method provided by the embodiment of the application, each node is subjected to anomaly scoring based on the target index value and the baseline data of the performance index corresponding to each node to obtain at least one target node with an anomaly, and the node generating the fault root cause is determined from the at least one target node based on the target node link where the at least one target node is located in consideration of the call relationship between the nodes. Therefore, the method and the device can quickly and effectively locate the node generating the fault root cause in the service system.
Corresponding to the method embodiment, the embodiment of the application also provides a fault positioning device. As shown in fig. 2, the fault locating device may include:
a target index value obtaining unit 210, configured to obtain a target index value of a performance index corresponding to each node in a service system;
a node score calculating unit 220, configured to calculate an abnormal evaluation score of each node based on the obtained target index value and baseline data corresponding to each performance index;
a fault locating unit 230, configured to, when it is determined that at least one target node exists in the service system based on the abnormal evaluation score, determine a node generating a fault root cause from the at least one target node based on a target node link where the at least one target node is located;
the target node is a node with an abnormal evaluation score meeting a preset abnormal condition.
The device provided by the embodiment of the application performs abnormal scoring on each node based on the target index value and the baseline data of the performance index corresponding to each node to obtain at least one target node with abnormality, and determines the node generating the fault root cause from the at least one target node based on the target node link where the at least one target node is located in consideration of the calling relationship between the nodes. Therefore, the method and the device can quickly and effectively locate the node generating the fault root cause in the service system. Alternatively, the target index value obtaining unit 210 may include:
and the index value obtaining subunit is used for periodically obtaining a target index value of the performance index corresponding to each node in the service system.
Optionally, the node score calculating unit 220 may include:
the index score calculating subunit is used for calculating the abnormal evaluation score of each performance index based on the target index value of the performance index and the corresponding baseline data;
and the node score calculating subunit is used for calculating the abnormal evaluation score of each node based on the abnormal evaluation score of each performance index corresponding to the node aiming at each node.
Optionally, the index score calculating subunit is specifically configured to:
aiming at each performance index, calculating an abnormal evaluation score of the performance index based on a target index value of the performance index and corresponding baseline data by using a preset abnormal evaluation model;
wherein the abnormality evaluation model is:
Gi=Fi*fi
wherein, Fi=|Ci-Mi|/Mi,GiEvaluation of the abnormality score for the performance index i, fiIs a preset basic score, C, corresponding to the performance index iiTarget index value, M, which is a performance index iiIs the baseline data corresponding to the performance index i.
Optionally, the fault location unit 230 may include:
a link determining subunit, configured to determine a target node link where the at least one target node is located;
the first positioning subunit is used for determining, for each target node link, when one target node exists in the target node link, the existing one target node as the node generating the fault root cause; when at least two target nodes exist in the target node link, the node which is located at the most downstream of the at least two target nodes is determined as the node generating the fault root cause.
Optionally, the fault location unit 230 may include:
a link determining subunit, configured to determine a target node link where the at least one target node is located;
the second positioning subunit is used for guiding a manager to determine a node generating a fault root factor from the at least one target node by outputting a link map corresponding to the target node link; wherein the target node in the link graph is highlighted.
In addition, corresponding to the fault positioning method, the embodiment of the application also provides the electronic equipment. Referring to fig. 3, at the hardware level, the electronic device includes a processor 310, an internal bus 320, a network interface 330, a memory 340, and a non-volatile memory 350, but may also include hardware required for other services. The processor 310 reads a corresponding computer program from the non-volatile memory 350 into the memory 340 and then operates to execute the fault location method provided by the present application, forming a fault location device on a logical level. Of course, besides the software implementation, the present application does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.
In addition, an embodiment of the present application further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is used to execute the above fault location method.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims (11)

1. A method of fault location, comprising:
obtaining a target index value of a performance index corresponding to each node in a service system;
calculating the abnormal evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index;
when judging that at least one target node exists in the service system based on the abnormal evaluation score, determining a target node link where the at least one target node is located, and for each target node link, determining the existing one target node as a node generating a fault root cause when one target node exists in the target node link, and determining a node which is located at the most downstream of the existing at least two target nodes as a node generating the fault root cause when at least two target nodes exist in the target node link;
the target node is a node with an abnormal evaluation score meeting a preset abnormal condition;
the calculation mode of the baseline data corresponding to any performance index comprises the following steps:
obtaining a historical index value corresponding to the performance index;
rejecting abnormal index values in the historical index values to obtain effective index values;
and calculating the baseline data corresponding to the performance index based on the effective index value according to the baseline calculation algorithm corresponding to the performance index.
2. The method of claim 1, wherein the step of obtaining the target index value of the performance index corresponding to each node in the service system comprises:
and periodically obtaining a target index value of the performance index corresponding to each node in the service system.
3. The method according to claim 1 or 2, wherein the step of calculating the anomaly evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index comprises:
aiming at each performance index, calculating an abnormal evaluation score of the performance index based on a target index value of the performance index and corresponding baseline data;
and aiming at each node, calculating the abnormal evaluation score of the node based on the abnormal evaluation scores of the performance indexes corresponding to the node.
4. The method of claim 3, wherein the step of calculating, for each performance indicator, an anomaly evaluation score for the performance indicator based on the target indicator value for the performance indicator and the corresponding baseline data comprises:
aiming at each performance index, calculating an abnormal evaluation score of the performance index based on a target index value of the performance index and corresponding baseline data by using a preset abnormal evaluation model;
wherein the abnormality evaluation model is:
Figure 999371DEST_PATH_IMAGE002
wherein,
Figure 385353DEST_PATH_IMAGE004
/
Figure 430669DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE008
the score is evaluated for an anomaly of the performance index i,
Figure DEST_PATH_IMAGE010
is a preset basic score corresponding to the performance index i,
Figure DEST_PATH_IMAGE012
is a target index value of the performance index i,
Figure 232403DEST_PATH_IMAGE006
is the baseline data corresponding to the performance index i.
5. The method according to claim 1 or 2, characterized in that the method further comprises:
determining a target node link where the at least one target node is located;
guiding a manager to determine a node generating a fault root factor from the at least one target node by outputting a link diagram corresponding to the target node link; wherein the target node in the link graph is highlighted.
6. A fault locating device, comprising:
a target index value obtaining unit, configured to obtain a target index value of a performance index corresponding to each node in a service system;
the node score calculating unit is used for calculating the abnormal evaluation score of each node based on the obtained target index value and the baseline data corresponding to each performance index, and the calculating mode of the baseline data corresponding to any performance index comprises the following steps: obtaining a historical index value corresponding to the performance index, eliminating abnormal index values in the historical index value to obtain an effective index value, and calculating baseline data corresponding to the performance index based on the effective index value according to a baseline calculation algorithm corresponding to the performance index;
the fault positioning unit comprises a link determining subunit and a first positioning subunit, and is used for determining a node generating a fault root cause from at least one target node based on a target node link where the at least one target node is located when the at least one target node is judged to exist in the service system based on the abnormal evaluation score;
a link determining subunit, configured to determine a target node link where the at least one target node is located;
the first positioning subunit is used for determining, for each target node link, when one target node exists in the target node link, the existing one target node as the node generating the fault root cause; when at least two target nodes exist in the target node link, determining the node which is located at the most downstream of the at least two existing target nodes as the node generating the fault root cause;
the target node is a node with an abnormal evaluation score meeting a preset abnormal condition.
7. The apparatus according to claim 6, wherein the target index value obtaining unit comprises:
and the index value obtaining subunit is used for periodically obtaining a target index value of the performance index corresponding to each node in the service system.
8. The apparatus according to claim 6 or 7, wherein the node score calculating unit includes:
the index score calculating subunit is used for calculating the abnormal evaluation score of each performance index based on the target index value of the performance index and the corresponding baseline data;
and the node score calculating subunit is used for calculating the abnormal evaluation score of each node based on the abnormal evaluation score of each performance index corresponding to the node aiming at each node.
9. The apparatus according to claim 6 or 7, wherein the fault localization unit further comprises:
a link determining subunit, configured to determine a target node link where the at least one target node is located;
the second positioning subunit is used for guiding a manager to determine a node generating a fault root factor from the at least one target node by outputting a link map corresponding to the target node link; wherein the target node in the link graph is highlighted.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the fault localization method of any of the preceding claims 1-5 when executing the program.
11. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the fault localization method of any one of the preceding claims 1-5.
CN201810142390.4A 2018-02-11 2018-02-11 Fault positioning method and device and electronic equipment Active CN110166264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810142390.4A CN110166264B (en) 2018-02-11 2018-02-11 Fault positioning method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810142390.4A CN110166264B (en) 2018-02-11 2018-02-11 Fault positioning method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110166264A CN110166264A (en) 2019-08-23
CN110166264B true CN110166264B (en) 2022-03-08

Family

ID=67635085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810142390.4A Active CN110166264B (en) 2018-02-11 2018-02-11 Fault positioning method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110166264B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110530650B (en) * 2019-09-05 2021-04-20 哈尔滨电气股份有限公司 Method for monitoring performance state of heavy-duty gas turbine based on generalized regression neural network and box diagram analysis
CN110601890B (en) * 2019-09-17 2023-03-31 深圳市网心科技有限公司 Network performance analysis method, device, equipment and readable storage medium
CN111459695B (en) * 2020-03-12 2024-09-27 平安科技(深圳)有限公司 Root cause positioning method, root cause positioning device, computer equipment and storage medium
CN111722952B (en) * 2020-05-25 2024-10-11 中国建设银行股份有限公司 Fault analysis method, system, equipment and storage medium of business system
CN114077510B (en) * 2020-08-11 2024-07-02 腾讯科技(深圳)有限公司 Method and device for positioning and displaying fault root cause
CN112346936A (en) * 2020-11-27 2021-02-09 中国工商银行股份有限公司 Application fault root cause positioning method and system
CN112565227B (en) * 2020-11-27 2023-05-19 深圳前海微众银行股份有限公司 Abnormal task detection method and device
CN112838962B (en) * 2020-12-31 2022-10-18 中国银联股份有限公司 Performance bottleneck detection method and device for big data cluster
CN113094249A (en) * 2021-04-30 2021-07-09 杭州安恒信息技术股份有限公司 Node abnormity detection method, device and medium
CN113032227B (en) * 2021-05-31 2021-12-07 北京宝兰德软件股份有限公司 Abnormal network element detection method and device, electronic equipment and storage medium
CN114966304A (en) * 2022-04-13 2022-08-30 中移互联网有限公司 Fault positioning method and device and electronic equipment
CN114844768B (en) * 2022-04-27 2024-08-23 广州亚信技术有限公司 Information analysis method and device and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183989A (en) * 2007-12-03 2008-05-21 中兴通讯股份有限公司 Incremental analysis method of optical synchronization transmission network alarm correlation
CN101577648A (en) * 2009-06-26 2009-11-11 杭州华三通信技术有限公司 Method for determining root cause of network fault and analytic equipment thereof
CN102611568A (en) * 2011-12-21 2012-07-25 华为技术有限公司 Failure service path diagnosis method and device
JP2016174281A (en) * 2015-03-17 2016-09-29 日本電信電話株式会社 Network evaluation device, network evaluation method and network evaluation program
CN105991339A (en) * 2015-03-05 2016-10-05 腾讯科技(深圳)有限公司 Alarm source positioning method and device
CN107040395A (en) * 2016-02-03 2017-08-11 腾讯科技(深圳)有限公司 A kind of processing method of warning information, device and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7958407B2 (en) * 2006-06-30 2011-06-07 Spx Corporation Conversion of static diagnostic procedure to dynamic test plan method and apparatus
CN102664760A (en) * 2012-04-28 2012-09-12 华为技术有限公司 Alarming method for communication system, equipment and communication system
CN103368776A (en) * 2013-07-09 2013-10-23 杭州东方通信软件技术有限公司 Method and system for evaluating equipment status by standardized physical examination
CN103412911B (en) * 2013-08-02 2016-08-10 中国工商银行股份有限公司 The method for monitoring performance of Database Systems and device
CN103617110B (en) * 2013-11-11 2016-09-07 国家电网公司 Server device condition maintenance system
CN105101277A (en) * 2015-09-01 2015-11-25 中国联合网络通信集团有限公司 Method, device and system for judging abnormalities of monitoring area and sensing node
CN106209920B (en) * 2016-09-19 2019-11-22 贵州白山云科技股份有限公司 A kind of safety protecting method and device of dns server
CN106776214B (en) * 2016-12-12 2019-03-01 广州市申迪计算机系统有限公司 A kind of server health degree appraisal procedure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183989A (en) * 2007-12-03 2008-05-21 中兴通讯股份有限公司 Incremental analysis method of optical synchronization transmission network alarm correlation
CN101577648A (en) * 2009-06-26 2009-11-11 杭州华三通信技术有限公司 Method for determining root cause of network fault and analytic equipment thereof
CN102611568A (en) * 2011-12-21 2012-07-25 华为技术有限公司 Failure service path diagnosis method and device
CN105991339A (en) * 2015-03-05 2016-10-05 腾讯科技(深圳)有限公司 Alarm source positioning method and device
JP2016174281A (en) * 2015-03-17 2016-09-29 日本電信電話株式会社 Network evaluation device, network evaluation method and network evaluation program
CN107040395A (en) * 2016-02-03 2017-08-11 腾讯科技(深圳)有限公司 A kind of processing method of warning information, device and system

Also Published As

Publication number Publication date
CN110166264A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN110166264B (en) Fault positioning method and device and electronic equipment
CN112162878B (en) Database fault discovery method and device, electronic equipment and storage medium
CN111404909B (en) Safety detection system and method based on log analysis
CN108494810A (en) Network security situation prediction method, apparatus and system towards attack
US8352789B2 (en) Operation management apparatus and method thereof
CN101997709B (en) Root alarm data analysis method and system
CN110995482B (en) Alarm analysis method and device, computer equipment and computer readable storage medium
CN111309565B (en) Alarm processing method and device, electronic equipment and computer readable storage medium
CN108924084B (en) Network equipment security assessment method and device
CN113497726B (en) Alarm monitoring method, alarm monitoring system, computer readable storage medium and electronic equipment
CN114465874B (en) Fault prediction method, device, electronic equipment and storage medium
CN111611517A (en) Index monitoring method and device, electronic equipment and storage medium
CN109992473A (en) Monitoring method, device, equipment and the storage medium of application system
CN115514627B (en) Fault root cause positioning method and device, electronic equipment and readable storage medium
CN114973741B (en) Abnormal data processing method and device, storage medium and electronic device
CN111669282A (en) Method, device and computer storage medium for identifying suspected root cause alarm
CN114338372A (en) Network information security monitoring method and system
CN116418653A (en) Fault positioning method and device based on multi-index root cause positioning algorithm
CN117891641A (en) Fault object positioning method and device, storage medium and electronic device
CN111078503A (en) Abnormity monitoring method and system
CN110399261B (en) System alarm clustering analysis method based on co-occurrence graph
CN117252640A (en) Fuse degradation method, rule engine system and electronic equipment
US12021680B1 (en) Detecting and mitigating cascading errors in a network to improve network resilience
CN112491584B (en) Service operation safety condition judgment method and device, electronic medium and storage medium
Chen et al. The monitoring system of Business support system with emergency prediction based on machine learning approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant