CN112506763A

CN112506763A - Automatic positioning method and device for database system fault root

Info

Publication number: CN112506763A
Application number: CN202011372173.8A
Authority: CN
Inventors: 裴丹; 刘平
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-03-16

Abstract

The present application proposes a method and device for automatically locating the root cause of a database system failure, which relates to the technical field of data processing. The method includes: performing anomaly detection on all monitoring indicators of an abnormal database, and obtaining abnormal monitoring indicator data; Monitoring indicator relationship diagram; analyze and sort each monitoring indicator in the monitoring indicator relationship diagram according to a preset algorithm; determine root cause monitoring indicators according to the sorting result. Therefore, the problem of automatically locating the root cause monitoring indicators of the faulty system is solved, and the efficiency of locating the root cause monitoring indicators is improved, so that the faulty system can quickly return to normal.

Description

Automatic positioning method and device for database system fault root

Technical Field

The application relates to the technical field of data processing, in particular to a method and a device for automatically positioning a fault root cause of a database system.

Background

When a computer system breaks down, operation and maintenance personnel need to rapidly analyze a large number of abnormal monitoring indexes in an abnormal machine, locate root cause indexes and then take measures to stop damage. In an actual production environment, because a large number of abnormal monitoring indexes depend on manual analysis of operation and maintenance personnel, the time consumption of the positioning process of the root index is long, and the system is in a fault state for a long time.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present application is to provide an automatic location method for a root cause of a fault in a database system, which solves the problem of automatic location of a root cause monitoring index of a fault system, and improves the efficiency of locating the root cause monitoring index, so that the fault system can be quickly recovered to normal.

The second purpose of the present application is to provide an automatic database system fault root cause locating device.

To achieve the above object, an embodiment of a first aspect of the present application provides a method for automatically locating a failure root cause of a database system, including:

carrying out anomaly detection on all monitoring indexes of an anomaly database to obtain anomaly monitoring index data;

constructing a monitoring index relation graph according to the abnormal monitoring index data;

analyzing and sequencing each monitoring index in the monitoring index relation graph according to a preset algorithm;

and determining a root cause monitoring index according to the sequencing result.

According to the automatic positioning method for the database system fault root, all monitoring indexes of an abnormal database are subjected to abnormal detection, and abnormal monitoring index data are obtained; constructing a monitoring index relation graph according to the abnormal monitoring index data; analyzing and sequencing each monitoring index in the monitoring index relation graph according to a preset algorithm; and determining a root cause monitoring index according to the sequencing result. Therefore, the problem of automatic positioning of the root cause monitoring index of the fault system is solved, and the efficiency of positioning the root cause monitoring index is improved, so that the fault system can be quickly recovered to be normal.

In an embodiment of the present application, the performing anomaly detection on all monitoring indexes of an anomaly database to obtain anomaly monitoring index data includes:

acquiring a system abnormal time period of each monitoring index from a time sequence database, and acquiring all monitoring index data corresponding to the system abnormal time period and a time period with a preset difference from the system abnormal time period;

and analyzing all monitoring index data by using a clustering-based robust anomaly detection algorithm to obtain the anomaly monitoring index data.

In an embodiment of the present application, the constructing a monitoring index relation graph according to the abnormal monitoring index data includes:

acquiring the relation between each abnormal monitoring index in the abnormal monitoring index data;

and constructing the monitoring index relation graph by taking each abnormal monitoring index as a point and taking the relation between each abnormal monitoring index as a side.

In an embodiment of the application, the analyzing and sorting the monitoring indexes in the monitoring index relation graph according to a preset algorithm includes

Analyzing the monitoring index relation graph by a positioning algorithm based on a weight type webpage access evaluation method to obtain the weight of a relation edge between each abnormal monitoring index in the monitoring index relation graph;

and sequencing the monitoring indexes according to the weight of the relation edge between the abnormal monitoring indexes.

In an embodiment of the present application, the method for automatically locating a database system fault root further includes:

and sending the root cause monitoring index to target equipment for display.

In order to achieve the above object, a second aspect of the present application provides an automatic database system fault root cause locating device, including:

the acquisition module is used for carrying out abnormity detection on all monitoring indexes of the abnormity database and acquiring abnormity monitoring index data;

the construction module is used for constructing a monitoring index relation graph according to the abnormal monitoring index data;

the analysis module is used for analyzing and sequencing all monitoring indexes in the monitoring index relation graph according to a preset algorithm;

and the determining module is used for determining the root cause monitoring index according to the sequencing result.

According to the automatic positioning device for the database system fault root cause, all monitoring indexes of an abnormal database are subjected to abnormal detection, and abnormal monitoring index data are obtained; constructing a monitoring index relation graph according to the abnormal monitoring index data; analyzing and sequencing each monitoring index in the monitoring index relation graph according to a preset algorithm; and determining a root cause monitoring index according to the sequencing result. Therefore, the problem of automatic positioning of the root cause monitoring index of the fault system is solved, and the efficiency of positioning the root cause monitoring index is improved, so that the fault system can be quickly recovered to be normal.

In an embodiment of the present application, the obtaining module is configured to:

In one embodiment of the present application, the building module is configured to:

In an embodiment of the application, the analysis module is specifically configured to:

In an embodiment of the present application, the apparatus further includes:

and the sending module is used for sending the root cause monitoring index to target equipment for displaying.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart illustrating a method for automatically locating a failure root cause of a database system according to an embodiment of the present application;

FIG. 2 is an example of a smoothed noise segment according to an embodiment of the present application;

FIG. 3 is an example of a smoothing segment according to an embodiment of the present application;

FIG. 4 is a smooth processing procedure of a real anomaly monitoring index according to an embodiment of the present application;

FIG. 5 is a weighted undirected monitoring index relationship diagram according to an embodiment of the present application;

fig. 6 is a diagram illustrating an exemplary structure of an automatic database system fault root cause location system according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an automatic database system fault root cause locating device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The database system fault root automatic positioning method and device according to the embodiment of the application are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a method for automatically locating a failure root cause of a database system according to an embodiment of the present application.

As shown in fig. 1, the automatic database system fault root locating method includes the following steps:

step 101, performing anomaly detection on all monitoring indexes of an anomaly database to obtain anomaly monitoring index data.

In the embodiment of the application, the system abnormal time period of each monitoring index is obtained from a time sequence database, and all monitoring index data corresponding to the system abnormal time period and the time period with the preset difference value from the system abnormal time period are obtained; and analyzing all monitoring index data by using a clustering-based robust anomaly detection algorithm to obtain anomaly monitoring index data.

In the embodiment of the application, all monitoring indexes of an abnormal database are subjected to abnormal detection, historical data (such as CPU utilization rate, disk utilization rate and the like) of the monitoring indexes are stored in a time sequence database, and when the abnormal detection is performed, the time sequence database is inquired to obtain data of each monitoring index in the vicinity of an abnormal time period of a system for analysis, so that in an actual production environment, even if the system is in a normal state, the monitoring index data of the system can have noise and irregular fluctuation.

Therefore, when the system is abnormal, noise, normal fluctuation, and abnormal fluctuation in the monitoring data are mixed together, which affects the performance of the abnormality detection. Therefore, the method designs a robust anomaly detection algorithm based on clustering, and the algorithm can effectively detect anomalies of monitoring data mixed with noise, normal fluctuation and abnormal fluctuation.

In the embodiment of the application, the cluster-based robust anomaly detection algorithm mainly divides the monitoring index into different segments through clustering, distinguishes a noise data segment and an anomaly fluctuation segment, and then carries out anomaly monitoring based on the cluster-based smoothing algorithm.

Specifically, the smoothing algorithm is a loop algorithm, and a part of the noise segment is smoothed in each loop. The algorithm will loop several times until all noise segments are smoothed, and the input of the algorithm is the raw data x of the monitoring index_iMonitoring index data after algorithm smoothing

In each cycle, firstly, the monitoring index data is gathered into two classes, such as a normal class and an abnormal class, through the gaussian mixture model, therefore, the normal class and the abnormal class are selected by the clustering number of the gaussian mixture model, and then, the monitoring index data is divided into different segments through the clustering result.

In the embodiment of the present application, { s } is used_jDenotes the jth segment, using | s_jI denotes s_jLength of the noise section, so when s is smaller than its neighbors_jThe length of which is less than s_j-1And s_j+1Then min { | s_j-1|,|s_j+1|}>|s_j|，{s_jIs a noise segment that is smoothed using data from its neighbors in the manner shown in FIG. 2, if s_jIs the noise segment, then s is used_j+1In random sampled data substitution s_j。

In particular from s_j+1For the reason of medium sampling, if the monitoring index is divided into k segments and the left end of the analysis window happens to be in the noise region, s₀Is a noise segment. If at that time s₁Is also a noise segment, then s₁Cannot be used from the noise section s₀The data of (2) is smoothed. Further, since the right end of the analysis window is at a time period of system abnormality, the last segment s_k-1May be an exception segment. If s is_k-2Is a noise segment, then the exception segment s is used_k-1Data smoothing s in (1)_k-2The result of the abnormality detection is not affected. Thus, using slave s_j+1Segment randomly sampled data versus noise segment s_jSmoothing is performed.

Finally, s₀And s_k-1Requiring separate treatment, first if s₀Is of the same class as the noise segment that has been smoothed, then s₀Also a noise section. At this time, the slave s is used₁Data smoothing s of medium random sampling₀. If there is no noise section to be smoothed, when s₁Is greater than s₁When, use is made of₁Data smoothing s of medium random sampling₀。

As shown in fig. 3, smoothing s₀Segment of, thuss_k-1In the system abnormal time period, so as to s_k-1The smoothing is not carried out; as shown in fig. 4, a smoothing process for a real anomaly monitoring index is shown. It can be seen that after two smoothing passes, all the noisy data is smoothed out and the outlier data is preserved.

After the smoothing processing, the monitoring index is finally divided into a plurality of sections, and fluctuation exists between two adjacent sections. The fault may be associated due to the one fluctuation that is closest to the time of the system fault. Thus, the wooden application is concerned only with s_k-2And s_k-1And the degree of abnormality of the fluctuation can be measured by z-score.

First, s_k-1Z-score through s for each data point in_k-2Mean and variance of the data std:

using s_k-1Mean of z-score for all data points in s_k-1Z-score of (1). Then, s is judged by 3-Sigma rule_k-2And s_k-1Whether the fluctuation in between is an abnormal fluctuation. If s is_k-1If the mean value of the z-score of all the data points in the data set is more than three times the variance, the fluctuation is abnormal fluctuation, and the corresponding monitoring index is also abnormal monitoring index.

And 102, constructing a monitoring index relation graph according to the abnormal monitoring index data.

In the embodiment of the application, the relation among all the abnormal monitoring indexes in the abnormal monitoring index data is obtained, and a monitoring index relation graph is constructed by taking all the abnormal monitoring indexes as points and taking the relation among all the abnormal monitoring indexes as sides.

In the embodiment of the application, the dependency graph construction algorithm can automatically construct a weighted undirected dependency graph to accurately represent the dependency relationship between abnormal monitoring indexes, and the constructed monitoring index relationship graph between the monitoring indexes is the core of root cause positioning.

Because the existing automatic dependency graph construction algorithm can deduce wrong dependency relationship, the application provides a Weighted Undirected Dependency Graph (WUDG), namely a monitoring index relation graph, which can more accurately represent the dependency relationship between monitoring indexes, and the core thought of constructing the weighted undirected dependency graph is as follows: if a dependency exists between two monitoring indexes, the two monitoring indexes are not independent, so that the design of the weighted undirected dependency graph is based on whether the dependency exists between the monitoring indexes (undirected graph) or not, and the direction of the dependency does not need to be inferred (directed graph). Determining whether a dependency exists may be more accurate than inferring the direction of the dependency.

Firstly, a full-connection graph among all abnormal monitoring indexes is constructed, nodes in the graph represent the monitoring indexes, edges represent the dependency relationship among the monitoring indexes, then the strength of the dependency relationship is calculated by performing independence check on the two monitoring indexes on each edge, for example, when the independence detection is performed on the two monitoring indexes X and Y, for example, the detection is performed by a Fisher-Z algorithm, and the independence between the two monitoring indexes X and Y is evaluated. The Fisher-Z algorithm is an independence detection method based on the Pearson correlation coefficient, and Fisher-Z transformation and partial correlation coefficient are combined. Where the Fisher-Z transform is used to evaluate the overall correlation and the partial correlation coefficients are used to evaluate the effect of other nodes, such as the Fisher-Z test between X and Y described above, can be expressed as

Where m denotes the number of monitoring index data, and r denotes a partial correlation coefficient between X and Y.

Specifically, the strength of the dependency relationship between two abnormal monitoring indexes can be measured by a p value of a zero hypothesis in Fisher-Z detection, and the p value is predicted to indicate that the weaker dependency relationship between the two abnormal monitoring indexes is, and conversely, to indicate that the stronger dependency relationship between the two abnormal monitoring indexes is, therefore, 1/p can be taken as the weight of an edge to finally generate a weighted undirected dependency relationship graph, such as an example of a monitoring index relationship graph shown in fig. 5, where the p value m isⁿRepresenting the p-value of the Fisher-Z measurement between the monitoring indices n and m.

And 103, analyzing and sequencing each monitoring index in the monitoring index relation graph according to a preset algorithm.

And 104, determining a root cause monitoring index according to the sequencing result.

In the embodiment of the application, the monitoring index relational graph is analyzed by a positioning algorithm based on a weighted web access evaluation method, the weight of the relation edge between abnormal monitoring indexes in the monitoring index relational graph is obtained, and the monitoring indexes are sorted according to the weight of the relation edge between the abnormal monitoring indexes.

In the embodiment of the application, the root cause related indexes are positioned based on the constructed monitoring index relation graph, and after the abnormal monitoring indexes are analyzed, the weighted undirected monitoring index relation graph among the abnormal monitoring indexes is generated.

It can be understood that a weighted undirected dependency graph between abnormal monitoring indexes, i.e. a monitoring index relation graph, is generated, and the graph contains root cause related indexes and symptom indexes. Therefore, in the application, it is assumed that the root cause related indexes are indexes with the largest influence in the Weighted undirected monitoring index relation graph, and a Weighted web access evaluation method (Weighted PageRank) can measure the influence of the nodes in the Weighted undirected graph. Therefore, the method designs a positioning algorithm based on a Weighted webpage access evaluation method (Weighted PageRank) to analyze the monitoring index relation graph and finally outputs a possible ranking result of the root cause related indexes, namely a diagnosis result of the root cause monitoring index automatic positioning technology.

Specifically, the weighted web page access evaluation method can measure the influence of the nodes in the weighted undirected graph, and for one monitoring index u, the score of u is calculated based on the weighted web page access evaluation method

Wherein, b (u) represents a node set (abnormal monitoring index with dependency relationship) directly connected with u, represents the weight of the edge between the node u and the node v, d is a constant, and by setting to 0.85, all the abnormal monitoring indexes are sorted by the calculated score, and the abnormal monitoring indexes arranged in the front are possible root cause correlation indexes.

In the embodiment of the application, the root cause monitoring index is sent to the target device to be displayed, so that the root cause monitoring index can be rapidly known, system faults are processed, and a fault system can be rapidly recovered to be normal.

Specifically, as shown in fig. 6, Web services require an underlying database to support their critical business and real-time applications. The root cause monitoring index automatic positioning technology is triggered when the performance of the database system is abnormal, for example, the response time of the database suddenly increases to carry out abnormal detection, the monitoring index relational graph is constructed and the root cause monitoring index automatic positioning technology is analyzed, and after the root cause monitoring index automatic positioning technology is analyzed, operation and maintenance personnel can rapidly take loss stopping measures based on a diagnosis result to enable the system to recover to be normal as soon as possible, wherein the common loss stopping measures comprise: SQL (database language) flow control, SQL optimization, system capacity expansion, and the like.

Therefore, with the rapid development of cloud services, performance monitoring and root cause analysis of the underlying database cluster supporting the cloud services face greater and greater challenges. For a bottom-layer large-scale database cluster supporting cloud services, database exception for hundreds of times per day makes manual exception diagnosis impossible. The system can automatically diagnose the performance abnormity of the online system, and when the system is abnormal, the root cause related indexes can be quickly positioned through the system, so that operation and maintenance personnel can rapidly analyze and take loss stopping measures, the system can be timely recovered to be normal, and the operation and maintenance personnel can focus on the root cause related indexes positioned by the algorithm, thereby greatly reducing the influence of alarm storm.

In order to implement the above embodiments, the present application further provides an automatic positioning device for a database system fault root cause.

Fig. 7 is a schematic structural diagram of an automatic positioning device for a database system fault root cause according to an embodiment of the present application.

As shown in fig. 7, the automatic database system fault root cause locating device includes: an acquisition module 210, a construction module 220, an analysis module 230, and a determination module 240.

The obtaining module 210 is configured to perform anomaly detection on all monitoring indexes of the anomaly database, and obtain anomaly monitoring index data.

The constructing module 220 is configured to construct a monitoring index relation graph according to the abnormal monitoring index data.

And the analysis module 230 is configured to analyze and sort the monitoring indexes in the monitoring index relation graph according to a preset algorithm.

And a determining module 240, configured to determine the root cause monitoring indicator according to the sorting result.

In an embodiment of the present application, the obtaining module 210 is configured to: acquiring a system abnormal time period of each monitoring index from a time sequence database, and acquiring all monitoring index data corresponding to the system abnormal time period and a time period with a preset difference from the system abnormal time period; and analyzing all monitoring index data by using a clustering-based robust anomaly detection algorithm to obtain the anomaly monitoring index data.

In one embodiment of the present application, a build module 220 is configured to: acquiring the relation between each abnormal monitoring index in the abnormal monitoring index data; and constructing the monitoring index relation graph by taking each abnormal monitoring index as a point and taking the relation between each abnormal monitoring index as a side.

In an embodiment of the present application, the analysis module 230 is specifically configured to: analyzing the monitoring index relation graph by a positioning algorithm based on a weight type webpage access evaluation method to obtain the weight of a relation edge between each abnormal monitoring index in the monitoring index relation graph; and sequencing the monitoring indexes according to the weight of the relation edge between the abnormal monitoring indexes.

In an embodiment of the present application, the apparatus further includes: and the sending module is used for sending the root cause monitoring index to target equipment for displaying.

It should be noted that the foregoing explanation of the embodiment of the method for automatically locating a database system fault root cause is also applicable to the apparatus for automatically locating a database system fault root cause of the embodiment, and is not repeated here.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A database system fault root cause automatic positioning method is characterized by comprising the following steps:

2. The method of claim 1, wherein the performing anomaly detection on all monitoring indexes of an anomaly database to obtain anomaly monitoring index data comprises:

3. The method of claim 1, wherein the constructing a monitoring index relationship graph according to the abnormal monitoring index data comprises:

4. The method of claim 1, wherein the analyzing and sequencing the monitoring indicators in the monitoring indicator relationship graph according to a preset algorithm comprises

5. The method of claim 1, further comprising:

and sending the root cause monitoring index to target equipment for display.

6. An automatic positioning device for a fault root cause of a database system is characterized by comprising:

7. The apparatus of claim 6, wherein the acquisition module is to:

8. The apparatus of claim 6, wherein the build module is to:

9. The apparatus of claim 6, wherein the analysis module is specifically configured to:

10. The apparatus of claim 6, further comprising: