CN114461434A - Fault root cause analysis method, device, electronic equipment and medium - Google Patents

Fault root cause analysis method, device, electronic equipment and medium Download PDF

Info

Publication number
CN114461434A
CN114461434A CN202210131335.1A CN202210131335A CN114461434A CN 114461434 A CN114461434 A CN 114461434A CN 202210131335 A CN202210131335 A CN 202210131335A CN 114461434 A CN114461434 A CN 114461434A
Authority
CN
China
Prior art keywords
node
alarm information
link structure
root cause
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210131335.1A
Other languages
Chinese (zh)
Inventor
程鹏
白佳乐
任政
郑凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202210131335.1A priority Critical patent/CN114461434A/en
Publication of CN114461434A publication Critical patent/CN114461434A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The disclosure provides a fault root cause analysis method, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring a link structure; constructing a fault rule tree according to the link structure, wherein the fault rule tree represents the incidence relation of alarm information generated by each node in the link structure; acquiring a plurality of indexes to be detected of each node in a link structure; acquiring first alarm information; acquiring an alarm information set sent by a plurality of nodes in a link structure within a specified time window; determining at least one node in a link structure as a root cause node corresponding to the first alarm information according to the plurality of historical alarm information and the fault rule tree; and calculating the deviation degree of each index to be detected of the root factor node by using a multivariate time sequence detection method, and determining the index to be detected corresponding to the highest deviation degree as the root factor index of the root factor node. The present disclosure also provides a failure root cause analysis apparatus, a device, a storage medium, and a program product.

Description

Fault root cause analysis method, device, electronic equipment and medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a fault root cause analysis method, apparatus, electronic device, and medium.
Background
In the distributed cloud computing era, applied links are more and more complex, nodes are more and more, calling relations among the nodes are more and more complex, when a link or a certain node fails, a plurality of nodes are often simultaneously alarmed, the fault positioning and solving are particularly important for finding out which index of which node has the fault, and the availability and the stability of the application are directly influenced by the fault positioning and solving.
At present, the root cause positioning commonly used in the fault analysis process mainly depends on manual analysis of development and operation and maintenance personnel, and the root cause causing the fault or the alarm is found out from a large amount of data. The method needs manpower to find the spider silk traces of the faults from a large number of files and data, and is dependent on expert experience, and relatively time-consuming and labor-consuming.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a fault root cause analysis method, apparatus, electronic device, and medium that improve root cause localization efficiency and fault index troubleshooting capability.
According to a first aspect of the present disclosure, there is provided a fault root cause analysis method, including: obtaining a link structure, wherein the link structure comprises a calling link among a plurality of nodes, and at least one node in the link structure generates alarm information corresponding to a fault event; constructing a fault rule tree according to the link structure, wherein the fault rule tree represents the incidence relation of each node in the link structure for generating alarm information; acquiring a plurality of indexes to be detected of each node in the link structure; acquiring first alarm information, wherein the first alarm information is sent by at least one first node in the link structure at a first moment; acquiring an alarm information set sent by a plurality of nodes in the link structure within a specified time window, wherein the specified time window is a specified time period before the first time, and the alarm information set comprises a plurality of historical alarm information sent by the plurality of nodes within the specified time window; determining at least one node in the link structure as a root cause node corresponding to the first alarm information according to the plurality of historical alarm information and the fault rule tree; and calculating the deviation degree of each index to be detected of the root cause node by using a multivariate time sequence detection method, and determining the index to be detected corresponding to the highest deviation degree as the root cause index of the root cause node.
According to an embodiment of the present disclosure, the plurality of nodes include an upstream node and a downstream node, and the link structure includes a call link in which the upstream node calls the downstream node; the constructing a fault rule tree according to the link structure specifically includes: acquiring upstream alarm information of the upstream node; and calling a calling link of a downstream node according to the upstream node to enable the downstream node to generate downstream alarm information, wherein the downstream alarm information and the upstream alarm information are of the same alarm type.
According to an embodiment of the present disclosure, the determining, according to the plurality of historical alarm information and the fault rule tree, at least one node in the link structure as a root cause node corresponding to the first alarm information specifically includes: screening effective alarm information in the plurality of historical alarm information according to the fault rule tree, wherein the effective alarm information is a plurality of alarm information and has the same alarm type with the first alarm information; acquiring a plurality of second nodes corresponding to the effective alarm information; and determining the most upstream node in the plurality of second nodes as a root cause node corresponding to the first alarm information.
According to an embodiment of the present disclosure, the calculating the deviation degree of each index to be detected of the root cause node by using the multivariate time series detection method specifically includes: normalizing each index to be detected of the root factor node; and determining the deviation degree of each index to be detected according to the standard deviation of the index to be detected after normalization processing.
According to the embodiment of the disclosure, at least one node of the nodes operates in a container, and the indexes to be detected comprise at least one of container CPU utilization rate, container memory utilization rate, container I/O utilization rate, network connection number, access time consumption and access success rate.
According to an embodiment of the present disclosure, when the plurality of to-be-detected indicators include a network connection number and access consumption, the determining the to-be-detected indicator corresponding to the highest degree of deviation as the root cause indicator of the root cause node specifically includes: comparing the network connection number of the root cause node and the deviation degree of the access time consumption with respective preset lower limit threshold values; if the network connection number and the deviation degree of the access time consumption are both greater than the lower limit threshold value, judging that the network is abnormal; and determining one of the network connection number and the access time consumption with higher deviation degree as the root index of the root node.
According to an embodiment of the present disclosure, the method further comprises: sorting the indexes to be detected of the root nodes except the root indexes according to the deviation degree; and sending the other indexes to be detected after sorting according to the deviation degree.
A second aspect of the present disclosure provides a failure root cause analysis device including: a link obtaining module, configured to obtain a link structure, where the link structure includes a call link between multiple nodes, and at least one node in the link structure generates alarm information corresponding to a fault event; the rule tree construction module is used for constructing a fault rule tree according to the link structure, wherein the fault rule tree represents the incidence relation of each node in the link structure for generating alarm information; the to-be-detected index acquisition module is used for acquiring a plurality of to-be-detected indexes of each node in the link structure; the alarm information acquisition module is used for acquiring first alarm information, wherein the first alarm information is sent by at least one first node in the link structure at a first moment; a historical alarm information obtaining module, configured to obtain an alarm information set sent by a plurality of nodes in the link structure within a specified time window, where the specified time window is a specified time period before the first time, and the alarm information set includes a plurality of pieces of historical alarm information sent by the plurality of nodes within the specified time window; a root cause node determining module, configured to determine, according to the multiple pieces of historical alarm information and the fault rule tree, at least one node in the link structure as a root cause node corresponding to the first alarm information; and the root cause index determining module is used for calculating the deviation degree of each index to be detected of the root cause node by using a multivariate time sequence detection method, and determining the index to be detected corresponding to the highest deviation degree as the root cause index of the root cause node.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the fault root cause analysis method described above.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above fault root cause analysis method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above fault root cause analysis method.
Compared with the prior art, the fault root cause analysis method, the fault root cause analysis device, the electronic equipment and the fault root cause analysis medium have the following beneficial effects:
according to the fault root cause analysis method, the fault rule tree and the multivariate time sequence detection algorithm are utilized to carry out node positioning, the fault rule tree is well established off line, so that the node positioning speed is high, the precision is high, meanwhile, for the positioning of indexes, the multivariate time sequence detection algorithm is utilized to synthesize the index deviation degree, the precision is high, and the detection timeliness is high.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates a system architecture of a fault root cause analysis method and apparatus according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a fault root cause analysis method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow diagram of fault rule tree construction according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow diagram of root cause node determination according to an embodiment of the disclosure;
FIG. 5 schematically illustrates a flow chart of a degree of deviation calculation according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow diagram of root cause indicator determination for a root cause node according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow chart of a fault root cause analysis method according to another embodiment of the present disclosure;
fig. 8 schematically shows a block diagram of the structure of a failure root cause analysis apparatus according to an embodiment of the present disclosure;
fig. 9 schematically illustrates a block diagram of an electronic device adapted to implement a fault root cause analysis method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The embodiment of the disclosure provides a fault root cause analysis method, a fault root cause analysis device, electronic equipment and a fault root cause analysis medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: obtaining a link structure, wherein the link structure comprises a calling link among a plurality of nodes, and at least one node in the link structure generates alarm information corresponding to a fault event; constructing a fault rule tree according to the link structure, wherein the fault rule tree represents the incidence relation of each node in the link structure for generating alarm information; acquiring a plurality of indexes to be detected of each node in the link structure; acquiring first alarm information, wherein the first alarm information is sent by at least one first node in the link structure at a first moment; acquiring an alarm information set sent by a plurality of nodes in the link structure within a specified time window, wherein the specified time window is a specified time period before the first time, and the alarm information set comprises a plurality of historical alarm information sent by the plurality of nodes within the specified time window; determining at least one node in the link structure as a root cause node corresponding to the first alarm information according to the plurality of historical alarm information and the fault rule tree; and calculating the deviation degree of each index to be detected of the root cause node by using a multivariate time sequence detection method, and determining the index to be detected corresponding to the highest deviation degree as the root cause index of the root cause node. According to the method and the device, at the alarm occurrence moment, the root cause node is positioned from the alarm information set of a certain reference time window on the link structure, after the root cause node is found, multivariate time sequence detection is carried out on each index to be detected defined on the node, and the specific fault caused by which index can be judged. Therefore, the alarm rule and the node index detection are comprehensively considered, and the root cause analysis is simple and high in timeliness.
Before describing in detail specific embodiments of the present disclosure, technical terms are first explained to facilitate a better understanding of the present disclosure.
Time series: a series of numbers is formed by sorting the values of a certain statistical index in chronological order. In a time series, it is necessary to predict its trend later based on the data currently existing in the time series.
Prophet algorithm: the time series prediction algorithm is open source of Facebook corporation, is based on an algorithm of a decomposable (trend, season or holiday) model, supports the influence of self-defining seasons and holidays, and has more flexible parameter configuration compared with a Holt-Winters (Triple/Three Order Exponential Smoothing) algorithm and an ARIMA algorithm.
Rule tree: the method is a classification method based on artificial experience, adopts a tree structure to construct a rule tree, and is flexible in rule configuration.
Fig. 1 schematically illustrates a system architecture 100 suitable for a fault root cause analysis method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the system architecture 100 according to this embodiment may include access terminals 101, 102, a user-side server 103, an intermediate server 104, and an overall server 105. The node devices are connected together through a network to form a distributed system architecture. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The access terminals 101, 102 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The user-side server 103, the intermediate server 104, and the main server 105 may each be a server that provides various services, such as a background management server (for example only) that provides support for websites browsed by users using the access terminals 101, 102. The background management server may analyze and perform other processing on the received data such as the user request, and sequentially feed back a processing result (for example, a webpage, information, or data obtained or generated according to the user request) to the terminal device according to the node.
Based on the distributed system architecture, the topological network structure of the links can be preset. And operation and maintenance personnel can construct a fault rule tree in advance according to the cause-and-effect relationship between alarm types which may occur to each node device of the distributed system architecture.
It should be noted that the fault root cause analysis method provided by the embodiment of the present disclosure may be generally performed by the overall server 105. Accordingly, the failure root cause analysis device provided by the embodiment of the present disclosure may be generally disposed in the overall server 105.
Specifically, the network status of all node devices may be uploaded into the global server 105 for invocation by root cause analysis of the fault. The general server 105 periodically collects information of each node device in the distributed system architecture and determines whether each node device fails according to the collected information. For example, when the main server 105 detects that the access terminals 101 and 102 and the client server 103 are down, the corresponding alarm data a0, b0 and c0 are generated and uploaded to the main server 105, and the main server 105 receives the alarm data and puts the alarm data into an alarm list, which may also contain other alarm data put before, such as the alarm data b1 and b 2. The master server 105 needs to analyze the alarms and determine which alarm belongs to the root cause alarm, and therefore, needs to take out each alarm information from all the alarm lists to perform fault root cause analysis.
It should be understood that the number of access terminals, user side servers, intermediate servers, and total servers in fig. 1 are merely illustrative. There may be any number of access terminals, customer premises servers, intermediate servers, and overall servers, as desired for the implementation.
The fault root cause analysis method of the disclosed embodiment will be described in detail below with fig. 2 to 7 based on the system architecture described in fig. 1.
Fig. 2 schematically illustrates a flow chart of a fault root cause analysis method according to an embodiment of the present disclosure.
As shown in fig. 2, the failure root cause analysis method of this embodiment includes operations S210 to S270.
In operation S210, a link structure is obtained, wherein the link structure includes a call link between a plurality of nodes, and at least one node in the link structure generates alarm information corresponding to a failure event.
In operation S220, a fault rule tree is constructed according to the link structure, wherein the fault rule tree represents an association relationship of each node in the link structure to generate alarm information.
The fault rule tree can be formulated by operation and maintenance support personnel or experts according to the link structure, and the fault rule tree can reflect the incidence relation of alarms of other nodes when various types of alarms occur in the link structure.
The fault rule tree may also be built from historical fault event related data sets using a correlation analysis algorithm. For example, the association analysis Algorithm may include an Apriori Algorithm (Apriori Algorithm), a maximum Frequent item set method (maximum Frequent items Approach), a Galois Closure Based method (Galois closed Based Approach), and the like, which are currently popular.
In operation S230, a plurality of indexes to be detected of each node in a link structure are obtained.
In the embodiment of the present disclosure, at least one node of the plurality of nodes operates in the container, data generated by the link structure may be transaction-type data, the to-be-detected indicator of each node is selected from fault indicators pre-stored in the node, and the plurality of to-be-detected indicators may select at least one of a container CPU usage rate, a container memory usage rate, a container I/O usage rate, a network connection number, access time consumption, and an access success rate, for example.
In operation S240, first alarm information is obtained, where the first alarm information is sent by at least one first node in a link structure at a first time.
In operation S250, an alarm information set issued by a plurality of nodes in a link structure within a specified time window is obtained, where the specified time window is a specified time period before a first time, and the alarm information set includes a plurality of historical alarm information issued by the plurality of nodes within the specified time window.
The first alarm information indicates that a link fails at a first moment, namely, an alarm occurrence moment, and the alarm information set is a plurality of pieces of historical alarm information acquired in a specified time period before the first moment. It will be appreciated that the prescribed time window may be based on expert experience as a reference time window for generating alarms. For example, the prescribed time window may be the first 5 minutes of the first time instant. Whereby at a first time instant all alarm messages on the link structure within the first 5 minutes of the first time instant are pulled to form an alarm message set to analyze the root cause node of this failure event at the first time instant.
In an embodiment of the present disclosure, the set of alarm information includes different types of alarm information. For ease of understanding, the alarm types are preset, and may be set, for example, to the previously indicated root cause alarm type and object alarm type. In other embodiments, different alarm types may also be set according to actual needs, and the disclosure is not limited to these two types.
In operation S260, at least one node in the link structure is determined as a root cause node corresponding to the first alarm information according to the plurality of historical alarm information and the fault rule tree.
In operation S270, the deviation degrees of the to-be-detected indexes of the root node are calculated by using a multivariate time series detection method, and the to-be-detected index corresponding to the highest deviation degree is used as the root index of the root node.
In the embodiment of the disclosure, the multivariate time series detection method includes a Prophet algorithm, and the deviation degree of each index to be detected of the root cause node includes the degree of deviation of each index to be detected from the average value.
The Prophet algorithm is a time-series algorithm, and is suitable for business behavior data with obvious internal rules, such as: the business problem is characterized as follows: historical data observed hourly, daily, or weekly for at least several months (preferably a year); there are important holidays (e.g., national festivals) known in advance to occur at irregular intervals. Therefore, the deviation degree of each index is calculated by utilizing the Prophet algorithm, the accuracy is high, and the detection timeliness is high.
By the embodiment of the disclosure, at the alarm occurrence time, the root cause node is positioned from the alarm information set of a certain reference time window on the link structure, and after the root cause node is found, the multivariate time sequence detection is performed on each to-be-detected index defined on the node, so that the fault caused by which index can be specifically judged. Therefore, the embodiment comprehensively considers the alarm rule and the node index detection, and provides a simple and high-timeliness fault root cause positioning method.
In the embodiment of the present disclosure, the plurality of nodes include an upstream node and a downstream node, and the link structure includes a call link in which the upstream node calls the downstream node.
FIG. 3 schematically shows a flow diagram of fault rule tree construction according to an embodiment of the present disclosure.
As shown in fig. 3, the building of the fault rule tree according to the link structure in operation S220 may specifically include operations S2201 to S2202.
In operation S2201, upstream alarm information of an upstream node is acquired.
In operation S2202, a downstream node is caused to generate downstream alarm information according to a call link in which the upstream node calls a downstream node, where the downstream alarm information is of the same alarm type as the upstream alarm information.
For example, if the link is composed of A- > B- > C- > D, the D node is located at the most upstream of the link, the A node is located at the most downstream of the link, if the D node sends out first alarm information at the first moment, the first alarm information can cause the C node, the B node and the A node to all generate alarm information with the same type as the first alarm information according to the construction mode of the fault rule tree.
Therefore, the construction mode of the fault rule tree enables the alarm information of the upstream node to automatically trigger the downstream node to generate the alarm information of the same alarm type, so that the alarm information generated among different nodes in the link structure has a corresponding association relation.
Fig. 4 schematically illustrates a flow diagram of root cause node determination according to an embodiment of the present disclosure.
As shown in fig. 4, the determining, in operation S260, at least one node in the link structure as a root cause node corresponding to the first alarm information according to the multiple pieces of history alarm information and the fault rule tree may specifically include operations S2601 to S2603.
In operation S2601, effective alarm information in the plurality of historical alarm information is filtered according to the fault rule tree, where the effective alarm information is multiple and has the same alarm type as the first alarm information.
In operation S2602, a plurality of second nodes corresponding to valid alarm information are obtained.
In operation S2603, a most upstream node of the plurality of second nodes is determined as a root node corresponding to the first alarm information.
It should be noted that, the plurality of second nodes corresponding to the plurality of effective alarm information are continuous and uninterrupted in the link structure, and there is generally no isolated node with discontinuous intervals.
Therefore, after the plurality of effective alarm information are obtained, the plurality of second nodes are correspondingly obtained, and the most upstream node in the plurality of second nodes is determined as the root node corresponding to the first alarm information sent at the first moment. Since there may be a plurality of first nodes at the first time, and correspondingly, there may also be a plurality of root nodes, each root node corresponding to the first alarm information sent by one first node at the first time.
The root cause node corresponding to the first alarm information is iteratively updated by the alarm information set according to the fault rule tree, generally tending to the most upstream node in the link structure generating the alarm.
FIG. 5 schematically shows a flow chart of the degree of deviation calculation according to an embodiment of the present disclosure.
As shown in fig. 5, the calculating of the deviation degree of each index to be detected of the root node by using the multivariate time series detection method in operation S270 may specifically include operation S2701 to operation S2702.
In operation S2701, normalization processing is performed on each index to be detected of the root node.
In operation S2702, the deviation of each index to be detected is determined according to the standard deviation of the index to be detected after the normalization process.
Therefore, each index to be detected of the root cause node can be stable index data, and the proportion of the absolute value of the difference between each index to be detected and the standard deviation to the standard deviation is used as the deviation degree of each index to be detected.
Fig. 6 schematically shows a flow chart of root cause indicator determination for a root cause node according to an embodiment of the present disclosure.
As shown in fig. 6, to further improve the accuracy of the root cause indicator, when the plurality of indicators to be detected include the network connection number and the access consumption, the determining step in operation S270 determines the indicator to be detected corresponding to the highest deviation degree as the root cause indicator of the root cause node, which may specifically include operations S2703 to S2706.
In operation S2703, the number of network connections and the degree of deviation of the access time of the root cause node are compared with respective preset lower threshold values.
In operation S2704, if both the number of network connections and the degree of deviation of the access time are greater than the lower threshold, it is determined that there is an abnormality in the network.
In operation S2705, the one with the higher degree of deviation among the number of network connections and the access time consumption is determined as the root cause index of the root cause node.
Therefore, when the network connection number and the access time consumption are detected to be in the state of network abnormity, one of the two indexes with higher deviation degree is used as the root index of the root node of the current fault event.
Fig. 7 schematically illustrates a flow chart of a fault root cause analysis method according to another embodiment of the present disclosure.
As shown in fig. 7, in another embodiment, after the root index of the root node is obtained in operation S270, operations S280 to S290 may be further included.
In operation S280, the indexes to be detected, except the root index, of the root node are sorted according to the deviation degree.
In operation S290, other indexes to be detected sorted according to the deviation degree are sent.
Therefore, other indexes to be detected except the root index on the root node can be sorted according to the deviation degree and then sent to development or related personnel.
According to the embodiment of the disclosure, the node positioning is carried out by utilizing the fault rule tree and the multivariate time sequence detection algorithm, and the fault rule tree is well established off-line, so that the node positioning speed is high and the accuracy is high. Meanwhile, for the positioning of the indexes, the multivariate time sequence detection algorithm is utilized to synthesize the index deviation degree, the precision is high, and the detection timeliness is high, so that the fault root cause analysis method has the characteristics of high precision and quick response.
Based on the fault root cause analysis method, the disclosure also provides a fault root cause analysis device. The apparatus will be described in detail below with reference to fig. 8.
Fig. 8 schematically shows a block diagram of a fault root cause analysis apparatus according to an embodiment of the present disclosure.
As shown in fig. 8, the failure root cause analysis apparatus 800 of this embodiment includes a link acquisition module 810, a rule tree construction module 820, an index to be detected acquisition module 830, an alarm information acquisition module 840, a history alarm information acquisition module 850, a root cause node determination module 860, and a root cause index determination module 870.
A link obtaining module 810, configured to obtain a link structure, where the link structure includes a call link between multiple nodes, and at least one node in the link structure generates alarm information corresponding to a failure event. In an embodiment, the link obtaining module 810 may be configured to perform the operation S210 described above, which is not described herein again.
And a rule tree constructing module 820, configured to construct a fault rule tree according to the link structure, where the fault rule tree represents an association relationship between nodes in the link structure to generate alarm information. In an embodiment, the rule tree building module 820 may be configured to perform the operation S220 described above, which is not described herein again.
The to-be-detected index obtaining module 830 is configured to obtain multiple to-be-detected indexes of each node in the link structure. In an embodiment, the to-be-detected indicator obtaining module 830 may be configured to perform the operation S230 described above, which is not described herein again.
The warning information obtaining module 840 is configured to obtain first warning information, where the first warning information is sent by at least one first node in a link structure at a first time. In an embodiment, the alarm information obtaining module 840 may be configured to perform the operation S240 described above, which is not described herein again.
The historical alarm information acquiring module 850 is configured to acquire an alarm information set sent by a plurality of nodes in a link structure within a specified time window, where the specified time window is a specified time period before the first time, and the alarm information set includes a plurality of historical alarm information sent by the plurality of nodes within the specified time window. In an embodiment, the history alarm information obtaining module 850 may be configured to perform the operation S250 described above, which is not described herein again.
And a root cause node determining module 860, configured to determine at least one node in the link structure as a root cause node corresponding to the first alarm information according to the plurality of historical alarm information and the fault rule tree. In an embodiment, the root cause node determining module 860 may be configured to perform the operation S260 described above, which is not described herein again.
The root cause index determining module 870 is configured to calculate a deviation degree of each to-be-detected index of the root cause node by using a multivariate time series detection method, and determine the to-be-detected index corresponding to the highest deviation degree as the root cause index of the root cause node. In an embodiment, the root cause indicator determining module 870 may be configured to perform the operation S270 described above, which is not described herein again.
According to the embodiment of the disclosure, at the alarm occurrence time, a root cause node is positioned from an alarm information set of a certain reference time window on a link, after the root cause node is found, multivariate time series detection is performed on each index to be detected defined on the node, and a fault caused by which index can be judged specifically.
According to the embodiment of the present disclosure, any multiple modules of the link obtaining module 810, the rule tree constructing module 820, the to-be-detected index obtaining module 830, the alarm information obtaining module 840, the historical alarm information obtaining module 850, the root cause node determining module 860, and the root cause index determining module 870 may be combined into one module to be implemented, or any one module may be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to the embodiment of the present disclosure, at least one of the link obtaining module 810, the rule tree constructing module 820, the index to be detected obtaining module 830, the alarm information obtaining module 840, the historical alarm information obtaining module 850, the root cause node determining module 860, and the root cause index determining module 870 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementation manners of software, hardware, and firmware, or implemented by a suitable combination of any of them. Alternatively, at least one of the link obtaining module 810, the rule tree constructing module 820, the index to be detected obtaining module 830, the alarm information obtaining module 840, the historical alarm information obtaining module 850, the root cause node determining module 860, and the root cause index determining module 870 may be at least partially implemented as a computer program module, which may perform corresponding functions when executed.
Fig. 9 schematically illustrates a block diagram of an electronic device adapted to implement a fault root cause analysis method according to an embodiment of the present disclosure.
As shown in fig. 9, an electronic apparatus 900 according to an embodiment of the present disclosure includes a processor 901 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. Processor 901 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 901 may also include on-board memory for caching purposes. The processor 901 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. The processor 901 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the programs may also be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
Electronic device 900 may also include input/output (I/O) interface 905, input/output (I/O) interface 905 also connected to bus 904, according to an embodiment of the present disclosure. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The above-mentioned computer-readable storage medium carries one or more programs which, when executed, implement the fault root cause analysis method according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 902 and/or the RAM 903 described above and/or one or more memories other than the ROM 902 and the RAM 903.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the fault root cause analysis method provided by the embodiment of the disclosure.
The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 901. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, and downloaded and installed through the communication section 909 and/or installed from the removable medium 911. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the processor 901, performs the above-described functions defined in the system of the embodiment of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (11)

1. A fault root cause analysis method is characterized by comprising the following steps:
obtaining a link structure, wherein the link structure comprises a calling link among a plurality of nodes, and at least one node in the link structure generates alarm information corresponding to a fault event;
constructing a fault rule tree according to the link structure, wherein the fault rule tree represents the incidence relation of each node in the link structure for generating alarm information;
acquiring a plurality of indexes to be detected of each node in the link structure;
acquiring first alarm information, wherein the first alarm information is sent by at least one first node in the link structure at a first moment;
acquiring an alarm information set sent by a plurality of nodes in the link structure within a specified time window, wherein the specified time window is a specified time period before the first time, and the alarm information set comprises a plurality of historical alarm information sent by the plurality of nodes within the specified time window;
determining at least one node in the link structure as a root cause node corresponding to the first alarm information according to the plurality of historical alarm information and the fault rule tree; and
and calculating the deviation degree of each index to be detected of the root factor node by using a multivariate time sequence detection method, and determining the index to be detected corresponding to the highest deviation degree as the root factor index of the root factor node.
2. The method of claim 1, wherein the plurality of nodes includes an upstream node and a downstream node, the link structure including a call link where the upstream node calls the downstream node;
the constructing a fault rule tree according to the link structure specifically includes:
acquiring upstream alarm information of the upstream node; and
and calling a calling link of a downstream node according to the upstream node to enable the downstream node to generate downstream alarm information, wherein the downstream alarm information and the upstream alarm information are in the same alarm type.
3. The method according to claim 2, wherein the determining, according to the plurality of historical alarm information and the fault rule tree, at least one node in the link structure as a root cause node corresponding to the first alarm information specifically includes:
screening effective alarm information in the plurality of historical alarm information according to the fault rule tree, wherein the effective alarm information is a plurality of alarm information and has the same alarm type with the first alarm information;
acquiring a plurality of second nodes corresponding to the effective alarm information; and
and determining the most upstream node in the plurality of second nodes as a root cause node corresponding to the first alarm information.
4. The method according to claim 1, wherein the calculating the deviation degree of each index to be detected of the root node by using the multivariate time series detection method specifically comprises:
normalizing each index to be detected of the root factor node; and
and determining the deviation degree of each index to be detected according to the standard deviation of the index to be detected after normalization processing.
5. The method according to claim 1, wherein at least one of the nodes operates in a container, and the plurality of indicators to be detected include at least one of container CPU usage, container memory usage, container I/O usage, network connection count, access time consumption, and access success rate.
6. The method according to claim 5, wherein when the plurality of to-be-detected indicators include a network connection number and access consumption, the determining the to-be-detected indicator corresponding to the highest degree of deviation as the root cause indicator of the root cause node specifically includes:
comparing the network connection number of the root cause node and the deviation degree of the access time consumption with respective preset lower limit threshold values;
if the network connection number and the deviation degree of the access time consumption are both greater than the lower limit threshold value, judging that the network is abnormal; and
and determining one of the network connection number and the access time consumption with higher deviation degree as the root index of the root node.
7. The method of claim 1, wherein the method further comprises:
sorting the indexes to be detected of the root nodes except the root indexes according to the deviation degree; and
and sending the other indexes to be detected after sorting according to the deviation degree.
8. A fault root cause analysis device, comprising:
a link obtaining module, configured to obtain a link structure, where the link structure includes a call link between multiple nodes, and at least one node in the link structure generates alarm information corresponding to a fault event;
the rule tree construction module is used for constructing a fault rule tree according to the link structure, wherein the fault rule tree represents the incidence relation of each node in the link structure for generating alarm information;
the to-be-detected index acquisition module is used for acquiring a plurality of to-be-detected indexes of each node in the link structure;
the alarm information acquisition module is used for acquiring first alarm information, wherein the first alarm information is sent by at least one first node in the link structure at a first moment;
a historical alarm information obtaining module, configured to obtain an alarm information set sent by a plurality of nodes in the link structure within a specified time window, where the specified time window is a specified time period before the first time, and the alarm information set includes a plurality of pieces of historical alarm information sent by the plurality of nodes within the specified time window;
a root cause node determining module, configured to determine at least one node in the link structure as a root cause node corresponding to the first alarm information according to the multiple historical alarm information and the fault rule tree; and
and the root cause index determining module is used for calculating the deviation degree of each index to be detected of the root cause node by using a multivariate time sequence detection method, and determining the index to be detected corresponding to the highest deviation degree as the root cause index of the root cause node.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.
10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.
11. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 7.
CN202210131335.1A 2022-02-11 2022-02-11 Fault root cause analysis method, device, electronic equipment and medium Pending CN114461434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210131335.1A CN114461434A (en) 2022-02-11 2022-02-11 Fault root cause analysis method, device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210131335.1A CN114461434A (en) 2022-02-11 2022-02-11 Fault root cause analysis method, device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN114461434A true CN114461434A (en) 2022-05-10

Family

ID=81413298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210131335.1A Pending CN114461434A (en) 2022-02-11 2022-02-11 Fault root cause analysis method, device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN114461434A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115102844A (en) * 2022-06-09 2022-09-23 摩拜(北京)信息技术有限公司 Fault monitoring and processing method and device and electronic equipment
CN115174353A (en) * 2022-07-14 2022-10-11 中国工商银行股份有限公司 Fault root cause determination method, device, equipment and medium
CN116820826A (en) * 2023-08-28 2023-09-29 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115102844A (en) * 2022-06-09 2022-09-23 摩拜(北京)信息技术有限公司 Fault monitoring and processing method and device and electronic equipment
CN115174353A (en) * 2022-07-14 2022-10-11 中国工商银行股份有限公司 Fault root cause determination method, device, equipment and medium
CN115174353B (en) * 2022-07-14 2024-04-16 中国工商银行股份有限公司 Fault root cause determining method, device, equipment and medium
CN116820826A (en) * 2023-08-28 2023-09-29 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain
CN116820826B (en) * 2023-08-28 2023-11-24 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain

Similar Documents

Publication Publication Date Title
CN114461434A (en) Fault root cause analysis method, device, electronic equipment and medium
US11640348B2 (en) Generating anomaly alerts for time series data
CN109471783B (en) Method and device for predicting task operation parameters
US20170270419A1 (en) Escalation prediction based on timed state machines
CN113704065A (en) Monitoring method, device, equipment and computer storage medium
CN107704387A (en) For the method, apparatus of system early warning, electronic equipment and computer-readable medium
CN115174353B (en) Fault root cause determining method, device, equipment and medium
CN115422003A (en) Data quality monitoring method and device, electronic equipment and storage medium
CN114443437A (en) Alarm root cause output method, apparatus, device, medium, and program product
CN113420935A (en) Fault location method, apparatus, device and medium
CN113515399A (en) Data anomaly detection method and device
US20230342699A1 (en) Systems and methods for modeling and analysis of infrastructure services provided by cloud services provider systems
CN113495825A (en) Line alarm processing method and device, electronic equipment and readable storage medium
US11212162B2 (en) Bayesian-based event grouping
CN114746844A (en) Identification of constituent events in an event storm in operations management
CN113900905A (en) Log monitoring method and device, electronic equipment and storage medium
CN114816955A (en) Database performance prediction method and device
CN114861909A (en) Model quality monitoring method and device, electronic equipment and storage medium
CN114281586A (en) Fault determination method and device, electronic equipment and computer readable storage medium
CN114500318A (en) Batch operation monitoring method and device, equipment and medium
CN113052509A (en) Model evaluation method, model evaluation apparatus, electronic device, and storage medium
CN113242148A (en) Method, device, medium and electronic equipment for generating monitoring alarm related information
CN114710397B (en) Service link fault root cause positioning method and device, electronic equipment and medium
CN117130873B (en) Task monitoring method and device
CN114237856A (en) Operation type identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination