CN114896090A - Database fault diagnosis method based on causal relationship - Google Patents

Database fault diagnosis method based on causal relationship Download PDF

Info

Publication number
CN114896090A
CN114896090A CN202210348192.XA CN202210348192A CN114896090A CN 114896090 A CN114896090 A CN 114896090A CN 202210348192 A CN202210348192 A CN 202210348192A CN 114896090 A CN114896090 A CN 114896090A
Authority
CN
China
Prior art keywords
monitoring
meta
variable
fault
mvi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210348192.XA
Other languages
Chinese (zh)
Other versions
CN114896090B (en
Inventor
裴丹
李明杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202210348192.XA priority Critical patent/CN114896090B/en
Publication of CN114896090A publication Critical patent/CN114896090A/en
Application granted granted Critical
Publication of CN114896090B publication Critical patent/CN114896090B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a database fault diagnosis method based on causal relationship, which comprises the following steps: collecting monitoring data of monitoring indexes of a database in a preset time period, and constructing a causal relationship graph among the monitoring indexes; the monitoring data comprises fault data and fault-free data; constructing a regression model for the monitoring indexes by using fault-free data based on the causal graph; calculating a regression error of the fault data through a regression model; calculating each monitoring index through a preset calculation formula based on the regression error so as to sort the monitoring indexes to obtain a monitoring index arrangement sequence; and determining the fault position of the database according to the arrangement sequence of the monitoring indexes and based on the monitoring indexes. The invention can realize accurate fault location, can screen one or more most key monitoring indexes from a large number of monitoring indexes, and assists technical personnel to enable the system to recover to normal.

Description

Database fault diagnosis method based on causal relationship
Technical Field
The invention relates to the technical field of fault diagnosis and causal relationship construction, in particular to a database fault diagnosis method based on causal relationship.
Background
The monitoring indexes belonging to the same database are mutually related and have causal relationship, and the phenomenon that when a fault occurs in the database, a plurality of monitoring indexes change simultaneously to interfere judgment of technicians is shown. As databases become more complex, it becomes difficult for a single technician to understand every detail in the system, and monitoring becomes increasingly relied upon. However, the number of monitoring indexes is also increasing, and understanding the relationship between the monitoring indexes is increasingly important and difficult for technicians to maintain the system and further build intelligent applications based on the monitoring data.
At the same time, causal discovery is focused on finding causal relationships between variables from observed data. The causal discovery based on observation data is an emerging research direction, and although a certain theoretical development exists, the construction results of the existing method have a small difference from the real causal relationship in the practical problem. For example, the constraint-based approach relies on a statistical conditional independent test tool. However, work has shown that there is no universally valid, condition-independent inspection tool. On the other hand, the gradient-based method risks overfitting the observed data and neglecting the correctness of the causal relationship.
And the two construction methods of Sage and MicroHECL based on causal hypothesis are designed aiming at the characteristics of the database, but the types of the considered monitoring indexes are few, and the method has no universality. The existing method lacks systematic description on more multivariate relations among monitoring of finer granularity of machine CPU utilization rate and service load and service delay.
Furthermore, monitoring is an important component in big data and is used for revealing the running state of the system, so that technicians can deduce and solve problems when the system does unexpected behaviors. A monitoring index refers to a dimension in monitoring. For example, average response time, access amount, and access success rate are common monitoring indicators in online service systems such as search engines and online shopping.
In the prior art, manual positioning: the technician checks each monitored data one by one. Depth-first search: scheme based on depth-first search: firstly, an abnormal detection technology is applied to screen out abnormal monitoring indexes. And then performing depth-first traversal along the abnormal monitoring indexes in a causal relationship graph formed among the monitoring indexes, and marking the monitoring indexes at the traversal stop position as root causes. Random walk: some prior arts calculate the pearson correlation coefficient, partial correlation coefficient, etc. of the monitoring index and the key monitoring index of the service level, then calculate the transition probability matrix by depending on the causal relationship graph between the monitoring indexes, and finally apply random walk to score and sort each monitoring index.
The prior art has the following disadvantages: the manual positioning method is time-consuming and labor-consuming, and in the face of more and more monitoring data and monitoring indexes which change violently at the same time, technicians cannot see the actual problem of database diagnosis, and the system is recovered to be normal after serious delay. The depth-first search-based method depends on the performance of an anomaly detection technology, and false alarm in anomaly detection can cause the depth-first search method to miss monitoring indexes pointing to the system fault source. At present, the working mechanism of random walk is not clear, and the calculation of a transition probability matrix lacks basis. In practical applications, diagnostic methods based on random walk vary widely in their performance in different databases.
When a fault occurs, a plurality of monitoring indexes change simultaneously, and the judgment of technicians is interfered. With increasing complexity, it is difficult for a single technician to understand every detail in the system and there is an increasing reliance on monitoring. However, the number of monitoring metrics is also increasing. Therefore, when a fault occurs, how to screen the most critical monitoring index or indexes from a large number of monitoring indexes and assist technicians to enable the system to recover to be normal becomes a problem to be solved urgently.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the invention aims to provide a database fault diagnosis method based on causal relationship, which can realize accurate fault location by constructing the causal relationship of monitoring indexes, so as to screen one or more most critical monitoring indexes from a large number of monitoring indexes and assist technicians to enable a system to be recovered to be normal.
In order to achieve the above object, the present invention provides a database fault diagnosis method based on causal relationship, including:
collecting monitoring data of monitoring indexes of a database in a preset time period, and constructing a causal relationship graph among the monitoring indexes; wherein the monitoring data comprises fault data and non-fault data; constructing a regression model for the monitoring index by using the fault-free data based on the causal relationship graph; calculating a regression error of the fault data through the regression model; calculating each monitoring index through a preset calculation formula based on the regression error so as to sort the monitoring indexes to obtain a monitoring index arrangement sequence; and determining the fault position of the database according to the arrangement sequence of the monitoring indexes and based on the monitoring indexes.
According to the database fault diagnosis method based on the causal relationship, when a fault occurs, one or more most key monitoring indexes are screened from a large number of monitors to assist technicians to enable the system to be normal, and the method is simple to implement, convenient to operate and high in efficiency.
In addition, the database fault diagnosis method based on the causal relationship according to the above embodiment of the present invention may further have the following additional technical features:
further, the monitoring index includes: access conditions, connection number occupation conditions, memory occupation conditions, disk capacity occupation conditions, index use conditions, network traffic, node states, data query conditions and/or node log information.
Further, said calculating a regression error of said fault data by said regression model comprises: for the data before the fault of the fault data when the fault occurs, the mean value m of the regression error of the data before the fault to the monitoring index is counted i Sum variance s i (ii) a And counting the regression error of the data in the fault process to the monitoring index as e ij
Further, the preset calculation formula is as follows:
z i =max j |e ij -m i |/s i
wherein z is i Is a statistic that measures whether each monitoring index Vi characterizes a fault.
Further, if a monitoring index set is constructed by the monitoring indexes, constructing a causal relationship graph among the monitoring indexes includes: dividing the monitoring indexes into corresponding meta-variables of corresponding components of a database architecture, and obtaining a first mapping from the meta-variables to the monitoring index set and a second mapping from the ordered meta-variable pairs to the monitoring index set; constructing a meta-variable causal relationship graph based on the database architecture and causal relationship information among the multiple meta-variables; constructing a causal graph between the monitoring indicators based on the meta-variable causal graph, the first mapping, and the second mapping; and instantiating the causal graph among the monitoring indexes into the causal graph among the monitoring indexes of the component instances.
Further, the database architecture is represented as a call relation graph Gc ═ Vc, Ec >, and the causal relation information among the multiple meta-variables includes: the relationship between component internal variables AG ═ Va, Ea >; and, mapping AP, AC, AD, and AA from the metavariable type to a set of metavariable types; the AP is a reason variable set from a calling party component, the AC is a result variable set at the calling party component, the AD is a result variable set at each level of the calling party component, and the AA is a reason variable set from each level of the calling party component.
Further, constructing a meta-variable causal graph Gm < Vm, Em > based on Gc, AG, AP, AC, AD, AA, including:
for each meta-variable type Tx in each component Ci, Va in Vc, add a meta-variable < Ci, Tx >; for each component Ci in Vc, for each edge Tx → Ty in Ea, add an edge < Ci, Tx > → < Ci, Ty > in Em; for each meta variable type Tx in each edge Ci → Cj, Va in Ec, each meta variable type Ty in AP (Tx), add an edge < Ci, Ty > → < Cj, Tx > in Em; for each meta variable type Tx in each edge Ci → Cj, Va in Ec, each meta variable type Ty in AC (Tx), add an edge < Cj, Tx > → < Ci, Ty > in Em; for each meta variable type Ty in each meta variable type Tx, AD (Tx) in each ancestor component Ci, Va of Cj in each component Cj, Gc in Vc, an edge < Cj, Tx > → < Ci, Ty > is added in Em; for each meta-variable type Ty in each meta-variable type Tx, AA (Tx) in each ancestral component Ci of Cj in each component Cj, Gc, Va in Vc, an edge < Ci, Ty > → < Cj, Tx > is added in Em.
Further, the causal relationship graph between the monitoring indexes is G ═<V,E>Traversing the meta-variable causal graph Gm in a topological order, starting from the meta-variable with the least causal variable, and sequentially executing the following steps for each sequentially traversed meta-variable MVi: order to
Figure BDA0003577814040000031
Order to
Figure BDA0003577814040000032
Performing first condition processing on each monitoring index Vy in M (MVi); performing second condition processing on each reason element variable MVj of MVi in the Gm; for each monitoring index Vx in P (MVi) and Vy in R (MVi), adding Vx → Vy to E; if it is
Figure BDA0003577814040000033
Let r (mvi) be p (mvi).
Further, the performing the first condition processing includes: if RM (Vy) only contains one element variable MVi, respectively adding Vy to R (MVi) and V; if RM (Vy) comprises a plurality of meta-variables, but except MVi, the RM (Vy) has the meta-variables which are not accessed when the causal graph among the monitoring indexes is constructed, no operation is performed; if RM (Vy) includes multiple meta-variables, and RM (Vy) has all the meta-variables except MVi accessed when constructing the causal relationship graph between the monitoring indexes
Adding Vy to P (MVi) and V, respectively; and adding Vx → Vy to E for each monitoring index Vx in the element variables MVj and R (MVj) except MVi in RM (Vy); the performing of the second condition processing includes: if it is
Figure BDA0003577814040000041
Figure BDA0003577814040000042
Then for each monitoring finger in R (MVj)Marking Vx, and adding the Vx into P (MVi); if it is
Figure BDA0003577814040000043
For each monitoring index Vx in R (MVj), and Vy in IV (MVj, MVi), adding Vy to P (MVi) and V, respectively, and adding Vx → Vy to E.
Further, instantiating the causal graph G between the monitoring indexes as a causal graph Gi between the monitoring indexes of the component instances, for each edge < Ci, x > → < Cj, y > in E, where x, y are monitoring index names, the following two cases are processed: if Ci ═ Cj, then for each instance u of component Ci, < Ci, u, x > → < Ci, u, y > are added to the graph Gi; if Ci ≠ Cj, then for each instance u of component Ci, each instance v of component Cj, < Ci, u, x > → < Cj, v, y > is added to the graph Gi.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a causal relationship based database fault diagnosis method according to an embodiment of the present invention;
FIG. 2 is a schematic overall flow chart for constructing a monitoring index causal relationship according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an example of call relationship between SQL processing logic and memory architecture inside an Oracle database according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the present invention, unless otherwise expressly stated or limited, "above" or "below" a first feature means that the first and second features are in direct contact, or that the first and second features are not in direct contact but are in contact with each other via another feature therebetween. Also, the first feature being "on," "above" and "over" the second feature includes the first feature being directly on and obliquely above the second feature, or merely indicating that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature includes the first feature being directly above and obliquely above the second feature, or simply meaning that the first feature is at a lesser level than the second feature.
A causal relationship-based database fault diagnosis method proposed according to an embodiment of the present invention is described below with reference to the accompanying drawings.
FIG. 1 is a flow chart of a causal relationship based database fault diagnosis method of one embodiment of the present invention.
As shown in fig. 1, the method includes, but is not limited to, the following steps:
step S1, collecting monitoring data of monitoring indexes in a preset time period of a database, and constructing a causal relationship graph among the monitoring indexes; wherein the monitoring data comprises fault data and non-fault data.
Specifically, the operation status of the database, including the problem to be optimized and/or the fault problem, is embodied in the specific status index feature information, and in order to implement the automatic optimization processing of the database, in the embodiment of the present invention, some status indexes are required to be preset as monitoring objects, and the monitoring indexes of the corresponding database are acquired by monitoring the status indexes. Wherein the monitoring is continuously performed so as to find out the problems of the database cluster in time. These preset status indicators include, but are not limited to, the following: access conditions, connection number occupation conditions, memory occupation conditions, disk capacity occupation conditions, index use conditions, network traffic, node states, data query conditions and/or node log information.
In this embodiment, monitoring data corresponding to the preset time period is collected from the state information of the monitored database and stored in the database, so that a causal graph can be constructed for the monitoring data. The monitoring data includes failure data and non-failure data.
It will be appreciated that the invention maps the failure of the database to an intervention (intervention) in the cause and effect inference theory and assumes that the intervention that causes the change in the monitored indicator is the failure to be considered in the database according to the principle of the okam razor. For the causal relationship graph G between the monitoring indexes, < V, E >, V is a set of all monitoring indexes, E is a set of edges (causal relationship) between the monitoring indexes, and the set of parent nodes (causal variables) of any monitoring index Vi in G is pa (Vi). According to the causal inference theory, the following determination conditions can be obtained:
"one monitoring index Vi is part of an intervention", equivalent to "P (Vi | pa (Vi) ═ pa (Vi)), do (Vi) ≠ P (Vi | pa (Vi) ═ pa (Vi))";
wherein "do" is an intervention operator (do-calculus), which is a mathematical symbol representing "intervention" in causal inference;
p (Vi | pa (Vi) ═ pa (Vi), do (Vi)) on the left of the inequality number indicates probability distribution of the intervention prediction, that is, value distribution of the monitoring index Vi when the causal variable takes a given value after the occurrence of the failure;
p (Vi | pa (Vi) ═ pa (Vi)) on the right of the inequality number indicates the intervention distribution when no intervention has occurred, that is, the value distribution of the monitoring index Vi when the cause variable takes a given value before the occurrence of the fault.
Further, in order to better understand the relationship construction, the present embodiment describes the monitoring index as the monitoring variable, and the database architecture may be expressed as a calling relationship graph Gc ═ Vc, Ec >. Taking an Oracle database as an example, each SQL request needs to execute 3 components in the architecture after being received by a database Server (Server): parse (Parse), hard Parse (HardParse), Execute (Execute). In this example, Vc ═ { Server, Parse, HardParse, Execute } is the set of components, Ec ═ Server → Parse, Server → HardParse, Server → Execute } is the calling relationship between the components. The call relationship graph may have different granularities, for example, the memory structure of the Oracle database may be subdivided into Buffer Cache, Redo Log Buffer, etc., and these components further occupy the memory resources (respectively "call" the memory). FIG. 3 is an example of a call relationship between SQL processing logic and memory architecture inside an Oracle database, as shown in FIG. 3.
For the monitoring index of each component in the database, the invention divides the monitoring index into four types of element variables: input distribution (I, Input), Output distribution (O, Output), response time distribution (L, Latency), resource utilization distribution (S, duration). The monitoring indexes which cannot be classified into the first three classes can be regarded as certain abstract resources in the description system, and then are classified into the fourth class. One Metavariable (MV) is a binary of component, metavariable types, e.g., metavariable < Server, L > describes the response time distribution of the database as a whole, to which "SQL time per second" of the database can be ascribed.
For the characteristics of the database, the invention introduces the following five sets of causal assumptions among the meta-variables.
1) The relationship AG (Assummed graph) between meta-variables within a component. AG ═ Va, Ea >, where Va ═ { I, O, L, S } represents all meta variable types. Ea { I → O, I → L, I → S, L → O, S → O, S → L } represents a relationship between component internal meta variables. The input (I) of the database is treated as the cause of other meta-variables, the output (O) is treated as the result of other meta-variables, and the resource utilization (S) is treated as a constraint on the response time (L).
2) A set of cause arguments ap (administered properties) from the caller component. An AP is a mapping from a metavariable type to a set of metavariable types. Specifically, ap (I) { I }, for example, in the Oracle database described above, the input of HardParse is determined by the input of the Server. While
Figure BDA0003577814040000071
I.e. the output, response time, resource utilization of a component are not directly affected by the caller.
3) The result meta-variable set AC (asserted child) at the calling component. An AC is a mapping from a metavariable type to a set of metavariable types. Specifically, for example, in the Oracle database, the output and response time of the Server are influenced by components such as Parse, HardParse, and Execute. While
Figure BDA0003577814040000072
Figure BDA0003577814040000073
That is, the input and resource utilization of a component cannot directly affect the caller.
4) Result argument sets AD (occupied Descriptions) at the caller components at each level. AD is from meta-variable type to meta-variableMapping of set of volume types. Specifically, ad (O) { O }, for example, in the aforementioned Oracle database, the output of the Server is affected by HardParse output, and then Data Dictionary Cache output. Ad (o) is distinguished from ac (o) in the sense that monitoring of the HardParse output distribution may miss part of the information of the Data Dictionary Cache output distribution, and ad (o) explicitly adds this limitation in conjunction with database features. While
Figure BDA0003577814040000074
I.e. without taking into account the cascading effect of other metavariable types.
5) The set of cause argument AA (consumed reactors) from the caller components at each level. AA is a mapping from a metavariable type to a set of metavariable types. In the present invention,
Figure BDA0003577814040000075
for completeness of design, the present invention still explicitly defines AA.
The causal assumptions above enrich the prior art discussion about the type of monitoring indicators and the relationship between the monitoring indicators. Based on the database architecture information and the causal hypothesis among the meta-variables, the database architecture is firstly refined into the causal relationship among the meta-variables, and further refined into the causal relationship among the monitoring variables. Table 1 summarizes the symbols used in the present invention and their meanings.
TABLE 1
Figure BDA0003577814040000076
Figure BDA0003577814040000081
As an example, as shown in fig. 2:
firstly, dividing monitoring variables: the technical personnel divide the monitoring variables into corresponding meta-variables of corresponding components of the database, and obtain the following two kinds of information:
step 1-1: mapping M from meta-variables to a set of monitoring variables, a first mapping. For example, M (< Server, L >) { < Server, SQL time per second > } indicates that the monitoring variable of "SQL time per second" of the Server belongs to the univariate variable of < Server, L >. The same monitoring variable may correspond to the meta-variables of multiple components, e.g., "logical read per execution" is computed as a separate monitoring variable from the input distribution of both the Server and the BufferCache components.
Step 1-2: a mapping IV (intermediate Variables) from pairs of ordered meta-Variables to sets of monitoring Variables, i.e. a second mapping. For example, the monitored variable of IV (< buffer cache, T >, < Storage, T >) { < Oracle, db file parallel read > }, which represents "db file parallel read" of the Orale database, is an intermediate variable that the T > affects the metadata < Storage, T >, i.e., < buffer cache, T > affects < Storage, T > is embodied by "db file parallel read", and < buffer cache, T > is not a direct cause of < Storage, T >.
Further, meta-variable causal relationship construction:
and constructing a causal relationship graph Gm between the meta-variables, namely < Vm, Em >, based on Gc, AG, AP, AC, AD and AA. This step is subdivided into the following steps:
step 2-1: for each component Ci in Vc, each meta-variable type Tx in Va, a meta-variable < Ci, Tx > is added to Vm.
Step 2-2: for each component Ci in Vc, for each edge Tx → Ty in Ea, an edge < Ci, Tx > → < Ci, Ty > is added in Em.
Step 2-3: for each meta-variable type Tx in each edge Ci → Cj, Va in Ec, each meta-variable type Ty in AP (Tx), an edge < Ci, Ty > → < Cj, Tx > is added in Em.
Step 2-4: for each meta-variable type Tx in each edge Ci → Cj, Va in Ec, each meta-variable type Ty in AC (Tx), an edge < Cj, Tx > → < Ci, Ty > is added in Em.
Step 2-5: for each meta variable type Ty in each meta variable type Tx, AD (Tx) in each ancestor component Ci of Cj in each component Cj, Gc, Va, an edge < Cj, Tx > → < Ci, Ty > is added in Em.
"Each ancestor component Ci of Cj in Gc" means that there are some components Cx, Cy, … …, Cz, such that Ci → Cx,
Cx → Cy, … …, Cz → Cj are all elements of Ec, or Ci → Cj are elements of Ec.
Step 2-6: for each meta-variable type Ty in each meta-variable type Tx, AA (Tx) in each ancestral component Ci of Cj in each component Cj, Gc, Va in Vc, an edge < Ci, Ty > → < Cj, Tx > is added in Em.
Further, the monitoring variable fills:
and constructing a monitoring variable causal relationship graph G & ltV, E & gt based on Gm, M and IV. The design of this step requires special consideration of two factors:
a meta-variable may not have a corresponding monitor variable. To this end, let R be a mapping from a metavariable to a set of monitoring variables. For any meta-variable MVi, r (MVi) represents the monitoring variable that MVi contains for the resulting meta-variable of MVi; r (MVi) inherits the monitoring variable from the cause element variable of MVi in Gm when MVi has no corresponding monitoring variable.
One monitoring variable may correspond to a plurality of meta-variables. To this end, let RM be a mapping from the monitoring variables to the set of meta-variables. RM is opposite to M in meaning, and for any monitoring variable Vi, RM (Vi) represents all the metavariables corresponding to Vi.
Gm is traversed in topological order and starting with the metavariable with the fewest causal variables. For each element variable MVi traversed in sequence, the following steps are sequentially executed:
step 3-1: order to
Figure BDA0003577814040000091
Order to
Figure BDA0003577814040000092
Step 3-2: for each monitoring variable Vy in m (mvi), the following three cases are handled:
step 3-2-1: if RM (Vy) contains only one meta-variable MVi, then Vy is added to R (MVi) and V, respectively.
Step 3-2-2: if rm (vy) contains a plurality of meta-variables, but there are meta-variables in rm (vy) other than MVi that have not been accessed in the topological order through Gm and starting from the meta-variable with the least number of cause variables, no operation is performed. This step is to avoid introducing a self-loop in case one monitoring variable corresponds to multiple meta-variables.
Step 3-2-3: if RM (Vy) includes multiple meta-variables, and all the meta-variables except MVi in RM (Vy) are accessed in the above steps, then:
adding Vy to P (MVi) and V, respectively;
for each monitored variable Vx in the RM (Vy) meta-variables MVj, R (MVj) except MVi, Vx → Vy is added to E.
Step 3-3: for each cause element variable MVj of MVi in Gm, the following two cases are processed:
step 3-3-1: if it is
Figure BDA0003577814040000101
For each monitored variable Vx in R (MVj), add Vx to P (MVi).
Step 3-3-2: if it is
Figure BDA0003577814040000102
For each monitored variable Vx in R (MVj), IV (MVj, MVi) and Vy is added to P (MVi) and V, respectively, and Vx → Vy is added to E.
Step 3-4: for each of the monitored variables Vx, R (MVi) in P (MVi), Vx → Vy is added to E.
Step 3-5: if it is
Figure BDA0003577814040000103
Let r (mvi) be p (mvi).
Further, the component instantiates:
multiple instances of the same component may exist in the database. For example, the database may be in a master-slave mode, with database DB corresponding to two instance database DB1 and database DB 2; in the aforementioned Oracle database, a plurality of storage devices may be employed. And instantiating the obtained G into a causal graph Gi among monitoring variables of each component instance. For each edge < Ci, x > → < Cj, y > in E, where x, y are the monitor variable names, the process proceeds in two cases:
step 4-1: if Ci ═ Cj, then for each instance u of component Ci, < Ci, u, x > → < Ci, u, y > are added to the graph Gi.
Step 4-2: if Ci ≠ Cj, then for each instance u of component Ci, each instance v of component Cj, < Ci, u, x > → < Cj, v, y > is added to the graph Gi.
As an implementation mode, the cause and effect assumption, step monitoring variable division and element variable cause and effect relationship construction in the invention are provided for a database. Causal assumptions and meta-variable causal relationship construction methods unique to other technical fields are not included in the present invention. However, if other technical fields also adopt a technical route of firstly constructing the causal relationship of the element variables and then constructing the causal relationship graph of the monitoring variables by a monitoring variable filling method, the steps of monitoring variable filling and component instantiation of the invention are also applicable. The specific values of the five sets of causal hypotheses AG, AP, AC, AD, and AA as key parameters should not be construed as limitations of the present invention. Within the design framework of the present invention, the replacement causal hypothesis can still construct a monitoring variable causal relationship through the steps encompassed by the present invention.
And step S2, constructing a regression model for the monitoring index by using the fault-free data based on the causal graph.
Specifically, for each monitoring variable Vi, a regression model of Vi versus pa (Vi) is constructed using the fault-free data.
The embodiment of the invention does not specifically limit the design of the regression model. One skilled in the art can select an existing regression technique according to the characteristics of the system, such as linear regression, Support Vector Regression (SVR) using a nonlinear kernel function, Recursive Neural Networks (RNN) adapted to time series with obvious autoregressive characteristics, and the like.
Further, the regression model may be constructed using historical data before the occurrence of the fault, or may be constructed on-line by using data from a period of time before the occurrence of the fault when the fault occurs. The former has large data volume and abundant time, and a complex regression model can be selected; the data of the latter is close to the latest state of the system, but a light-weight regression model is required to be selected due to the timeliness of fault diagnosis.
In step S3, a regression error of the failure data is calculated by the regression model.
Specifically, after a fault occurs, the constructed regression model is applied to data of a period of time during the fault (e.g., 10 minutes before the fault is found), a period of time before the fault (e.g., 2 hours before the detected fault occurs, and no coincidence with the previous period of time), respectively. For each monitored variable Vi, the regression error of Vi at each data point over the two passes was calculated:
for pre-fault data, the mean m of the regression error for Vi is counted i Sum variance s i
For data in the fault process, the regression error of Vi is recorded as e ij
And step S4, calculating each monitoring index through a preset calculation formula based on the regression error so as to sort the monitoring indexes to obtain a monitoring index arrangement sequence.
Specifically, after a failure occurs, z is calculated for each monitored variable Vi i =max j |e ij -m i |/s i . Wherein z is i The formula is used for unifying the standard deviation of regression errors of different monitoring variables, and is convenient to compare. And sequencing the monitoring indexes to obtain a monitoring index sequencing sequence.
And step S5, determining the fault position of the database according to the arrangement sequence of the monitoring indexes and based on the monitoring indexes.
Specifically, after a failure occurs, z is calculated from the above i The monitoring variables are sequenced, and technicians can determine the fault position of the database as early as possible when checking the monitoring variables according to the sequence recommended by the invention。
The beneficial effect of the embodiment of the invention can be measured by using the recall rate only considering the monitoring variable arranged at the front k bits, and is recorded as AC @ k. Table 2 compares the present invention with the prior art based on monitoring data and fault cases collected from the Oracle database. Compared with the 3 schemes, the invention obviously improves the performance of fault diagnosis. The 3 schemes compared were:
and (4) depth-first searching, and setting a threshold value for the Pearson correlation coefficient of the monitoring variable and the service index during traversal.
And (4) random walk, and calculating a transition probability matrix based on the monitoring variable and the partial correlation coefficient of the service index.
And ENMF, constructing an ARX model between every two monitoring variables, and modeling a fault propagation process.
TABLE 2 Effect of different Fault diagnosis techniques
Fault diagnosis technique AC@1 AC@3 AC@5
Depth-first search 0.278 0.419 0.449
Random walk 0.101 0.338 0.475
ENMF 0.126 0.293 0.389
The invention 0.328 0.601 0.677
In the embodiment of the invention, the precise fault positioning can be realized by constructing the causal relationship of the monitoring indexes, so that one or more most critical monitoring indexes can be screened from a large number of monitoring indexes, and technicians are assisted to enable the system to be recovered to be normal.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

Claims (10)

1. A database fault diagnosis method based on causal relationship is characterized by comprising the following steps:
collecting monitoring data of monitoring indexes of a database in a preset time period, and constructing a causal relationship graph among the monitoring indexes; wherein the monitoring data comprises fault data and non-fault data;
constructing a regression model for the monitoring index by using the fault-free data based on the causal relationship graph;
calculating a regression error of the fault data through the regression model;
based on the regression error, calculating each monitoring index through a preset calculation formula so as to sort the monitoring indexes to obtain a monitoring index arrangement sequence;
and determining the fault position of the database according to the arrangement sequence of the monitoring indexes and based on the monitoring indexes.
2. The method of claim 1, wherein monitoring the metrics comprises: access conditions, connection number occupation conditions, memory occupation conditions, disk capacity occupation conditions, index use conditions, network traffic, node states, data query conditions and/or node log information.
3. The method of claim 1, wherein said calculating a regression error of said fault data via said regression model comprises:
for the data before the fault of the fault data when the fault occurs, the mean value m of the regression error of the data before the fault to the monitoring index is counted i Sum variance s i (ii) a And (c) a second step of,
the data in the fault process of the fault data is counted, and the regression error of the data in the fault process to the monitoring index is calculated to be e ij
4. The method according to claim 1, wherein the preset calculation formula is:
z i =max j |e ij -m i |/s i
wherein z is i Is a statistic that measures whether each monitoring index Vi characterizes a fault.
5. The method according to claim 1, wherein the monitoring indexes are constructed into a monitoring index set, and the constructing of the causal graph among the monitoring indexes comprises:
dividing the monitoring indexes into corresponding meta-variables of corresponding components of a database architecture, and obtaining a first mapping from the meta-variables to the monitoring index set and a second mapping from the ordered meta-variable pair to the monitoring index set;
constructing a meta-variable causal relationship graph based on the database architecture and causal relationship information among the multiple meta-variables;
constructing a causal graph between the monitoring indicators based on the meta-variable causal graph, the first mapping, and the second mapping;
and instantiating the causal graph among the monitoring indexes into the causal graph among the monitoring indexes of the component instances.
6. The method according to claim 5, wherein the database architecture is represented as a callout graph Gc ═ Vc, Ec >, and the causal relationship information among the multiple meta-variables comprises:
the relation AG between the meta-variables in the component is < Va, Ea >; and (c) a second step of,
mapping AP, AC, AD and AA from the metavariable type to the metavariable type set; the AP is a reason variable set from a calling party component, the AC is a result variable set at the calling party component, the AD is a result variable set at each level of the calling party component, and the AA is a reason variable set from each level of the calling party component.
7. The method according to claim 6, wherein constructing a meta-variable causal graph Gm ═ Vm, Em > based on Gc, AG, AP, AC, AD, AA comprises:
for each meta-variable type Tx in each component Ci, Va in Vc, add a meta-variable < Ci, Tx >;
for each component Ci in Vc, for each edge Tx → Ty in Ea, add an edge < Ci, Tx > → < Ci, Ty > in Em;
for each meta variable type Tx in each edge Ci → Cj, Va in Ec, each meta variable type Ty in AP (Tx), add an edge < Ci, Ty > → < Cj, Tx > in Em;
for each meta variable type Tx in each edge Ci → Cj, Va in Ec, each meta variable type Ty in AC (Tx), add an edge < Cj, Tx > → < Ci, Ty > in Em;
for each meta variable type Ty in each meta variable type Tx, AD (Tx) in each ancestor component Ci, Va of Cj in each component Cj, Gc in Vc, an edge < Cj, Tx > → < Ci, Ty > is added in Em;
for each meta-variable type Ty in each meta-variable type Tx, AA (Tx) in each ancestral component Ci of Cj in each component Cj, Gc, Va in Vc, an edge < Ci, Ty > → < Cj, Tx > is added in Em.
8. The method according to claim 7, characterized in that the causal graph between the monitoring indicators is G ═ < V, E >, the metavariable causal graph Gm is traversed in topological order and starting from the metavariable with the least causal variable, for each metavariable MVi traversed in turn, the following steps are performed in sequence:
order to
Figure FDA0003577814030000021
Order to
Figure FDA0003577814030000022
Performing first condition processing on each monitoring index Vy in M (MVi);
performing second condition processing on each reason element variable MVj of MVi in the Gm;
for each monitoring index Vx in P (MVi) and Vy in R (MVi), adding Vx → Vy to E;
if it is
Figure FDA0003577814030000023
Let r (mvi) be p (mvi).
9. The method of claim 8,
the performing of the first case processing includes:
if RM (Vy) only contains one element variable MVi, respectively adding Vy to R (MVi) and V;
if RM (Vy) comprises a plurality of meta-variables, but except MVi, the RM (Vy) has the meta-variables which are not accessed when the causal graph among the monitoring indexes is constructed, no operation is performed;
if RM (Vy) includes multiple meta-variables, and all the meta-variables except MVi in RM (Vy) are accessed when constructing the causal graph between the monitoring indexes
Adding Vy to P (MVi) and V, respectively; and the number of the first and second groups,
for each monitoring index Vx in the element variables MVj and R (MVj) except MVi in RM (Vy), adding Vx → Vy into E;
the performing of the second condition processing includes:
if it is
Figure FDA0003577814030000031
Adding Vx into P (MVi) for each monitoring index Vx in R (MVj);
if it is
Figure FDA0003577814030000032
For each monitoring index Vx in R (MVj), and Vy in IV (MVj, MVi), adding Vy to P (MVi) and V, respectively, and adding Vx → Vy to E.
10. The method according to claim 9, characterized in that the causal graph G between the monitoring metrics is instantiated as a causal graph Gi between component instance monitoring metrics, where for each edge < Ci, x > → < Cj, y > in E, where x, y are monitoring metric names, treated in two cases:
if Ci ═ Cj, then for each instance u of component Ci, < Ci, u, x > → < Ci, u, y > are added to the graph Gi;
if Ci ≠ Cj, then for each instance u of component Ci, each instance v of component Cj, < Ci, u, x > → < Cj, v, y > is added to the graph Gi.
CN202210348192.XA 2022-04-01 2022-04-01 Database fault diagnosis method based on causal relationship Active CN114896090B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210348192.XA CN114896090B (en) 2022-04-01 2022-04-01 Database fault diagnosis method based on causal relationship

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210348192.XA CN114896090B (en) 2022-04-01 2022-04-01 Database fault diagnosis method based on causal relationship

Publications (2)

Publication Number Publication Date
CN114896090A true CN114896090A (en) 2022-08-12
CN114896090B CN114896090B (en) 2024-09-17

Family

ID=82715499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210348192.XA Active CN114896090B (en) 2022-04-01 2022-04-01 Database fault diagnosis method based on causal relationship

Country Status (1)

Country Link
CN (1) CN114896090B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170228277A1 (en) * 2016-02-08 2017-08-10 Nec Laboratories America, Inc. Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations
CN113746663A (en) * 2021-06-07 2021-12-03 西安交通大学 Performance degradation fault root cause positioning method combining mechanism data and dual drives
CN113918416A (en) * 2021-09-30 2022-01-11 上海浦东发展银行股份有限公司 Software system abnormity positioning method based on event map

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170228277A1 (en) * 2016-02-08 2017-08-10 Nec Laboratories America, Inc. Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations
CN113746663A (en) * 2021-06-07 2021-12-03 西安交通大学 Performance degradation fault root cause positioning method combining mechanism data and dual drives
CN113918416A (en) * 2021-09-30 2022-01-11 上海浦东发展银行股份有限公司 Software system abnormity positioning method based on event map

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AMIN DHAOU 等: "Causal and Interpretable Rules for Time Series Analysis", 《KDD \'21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING》, 1 August 2021 (2021-08-01), pages 2764 - 2772, XP058645955, DOI: 10.1145/3447548.3467161 *

Also Published As

Publication number Publication date
CN114896090B (en) 2024-09-17

Similar Documents

Publication Publication Date Title
CN108923952B (en) Fault diagnosis method, equipment and storage medium based on service monitoring index
CN112882911B (en) Abnormal performance behavior detection method, system, device and storage medium
CN106600115A (en) Intelligent operation and maintenance analysis method for enterprise information system
Jiang et al. Efficient fault detection and diagnosis in complex software systems with information-theoretic monitoring
CN106844161A (en) Abnormal monitoring and Forecasting Methodology and system in a kind of carrier state stream calculation system
CN115514627B (en) Fault root cause positioning method and device, electronic equipment and readable storage medium
CN113542017A (en) Network fault positioning method based on network topology and multiple indexes
CN104123448B (en) Multi-data-stream anomaly detection method based on context
Alevizos et al. Complex event recognition under uncertainty: A short survey
CN115048361B (en) Big data based database operation and maintenance risk early warning system and method
CN114637649B (en) Alarm root cause analysis method and device based on OLTP database system
CN112415331A (en) Power grid secondary system fault diagnosis method based on multi-source fault information
Ghiasvand et al. Anomaly detection in high performance computers: A vicinity perspective
CN114385403A (en) Distributed cooperative fault diagnosis method based on double-layer knowledge graph framework
CN107590008A (en) A kind of method and system that distributed type assemblies reliability is judged by weighted entropy
CN116069606B (en) Software system performance fault prediction method and system
CN114896090B (en) Database fault diagnosis method based on causal relationship
CN117034149A (en) Fault processing strategy determining method and device, electronic equipment and storage medium
Hu et al. TS-InvarNet: Anomaly detection and localization based on tempo-spatial KPI invariants in distributed services
CN116149895A (en) Big data cluster performance prediction method and device and computer equipment
CN115564410A (en) State monitoring method and device for relay protection equipment
CN117135037A (en) Method and device for defining network traffic performance abnormality
Lou et al. Research on Diagnostic Reasoning of Cloud Data Center Based on Bayesian Network and Knowledge Graph
Su et al. ADCMO: an anomaly detection approach based on local outlier factor for continuously monitored object
LYU et al. Alarm-Based Root Cause Analysis Based on Weighted Fault Propagation Topology for Distributed Information Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant