CN112749071B

CN112749071B - System and method for detecting health degree of application cluster

Info

Publication number: CN112749071B
Application number: CN202011609759.1A
Authority: CN
Inventors: 陈胜仇; 吴海洋; 吴倩; 花卉; 王玮; 马德晶; 周士成
Original assignee: Shanghai Data Center of China Life Insurance Co Ltd
Current assignee: Shanghai Data Center of China Life Insurance Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-11-14
Anticipated expiration: 2040-12-30
Also published as: CN112749071A

Abstract

The invention relates to a system and a method for detecting the health of an application cluster, wherein the method comprises the steps of initializing application cluster information to be monitored; the data acquisition unit acquires performance data generated by the APM tool and stores the performance data in the local database; the baseline calculation unit acquires a dynamic baseline according to the historical time sequence performance data in the local database; the abnormality detection unit detects abnormality of the real-time performance data in the local database; and the alarm unit judges the health degree of the application cluster according to the abnormality detection result of the abnormality detection unit and sends out alarm information. Compared with the prior art, the method and the device are based on the data acquisition unit, the dynamic baseline calculation unit, the anomaly detection unit, the alarm unit, the task scheduling unit and the local database, APM data of the application cluster are effectively stored, the health degree of the application cluster is obtained, the accuracy and the high efficiency of the health degree detection of the application cluster are improved, and the reliability of the health degree detection of the application cluster is improved.

Description

System and method for detecting health degree of application cluster

Technical Field

The invention relates to the field of application cluster health detection, in particular to a system and a method for detecting application cluster health.

Background

With the popularity of micro-service architecture, more and more IT projects are being developed with micro-service architecture. The integrity micro-service realizes decoupling among modules, and simultaneously meets the requirements of agile development of enterprises by the characteristics of independent deployment, rapid iteration and the like. However, with the increasing number of service modules, the call between services is more and more complex, and an application service call chain Analysis (APM) tool is needed to monitor the call condition and response performance of each application service.

Currently, more common APM tools are the Dapper, twitter zip of Pinpoint, google of Naver, the hawk eye of Naver, the CAT of the public criticizing, and so on.

Although the tool can well show the performance of each service and the calling relation among the services, the tool can alarm indexes such as transaction slow number, error reporting number and the like according to a user-defined threshold value. The following disadvantages still remain:

1. the setting of part of index alarm threshold depends on user experience and cannot be scientifically set;

2. according to the change of service pressure at different time points, partial performance indexes show a trend of dynamic change, and the existing tool lacks the capability of establishing a dynamic base line for the indexes;

3. the tolerance of the conventional threshold alarm to faults is low, and the false alarm rate is high.

Disclosure of Invention

The present invention is directed to a system and method for detecting the health of an application cluster, which overcomes the above-mentioned drawbacks of the prior art.

The aim of the invention can be achieved by the following technical scheme:

a system for detecting the health degree of an application cluster comprises a data acquisition unit, a dynamic baseline calculation unit, an anomaly detection unit, an alarm unit, a task scheduling unit and a local database,

the data acquisition unit is used for acquiring the performance data generated by the APM tool and storing the performance data into the local database,

the dynamic baseline calculation unit is used for generating a dynamic baseline according to the historical time sequence performance data in the local database,

the abnormality detection unit is used for detecting abnormality of the real-time performance data in the local database and labeling the real-time data with normal labels or abnormal labels,

the alarm unit judges the health degree of the application cluster according to the abnormal detection result of the abnormal detection unit and sends out alarm information,

the local database is used for storing performance data.

Preferably, the system further comprises a front-end display unit, wherein the front-end display unit is used for displaying the dynamic baseline and the health degree of each application cluster index in real time.

Preferably, the system further comprises a task scheduling unit for uniformly managing the tasks of data acquisition, dynamic baseline calculation, alarm polling check, data archiving and the like, and as a daemon, ensuring the normal operation of all the calculation modules.

A method for detecting the health of an application cluster, based on the system for detecting the health of the application cluster, comprising the following steps:

s1: initializing application cluster information to be monitored;

s2: the data acquisition unit acquires performance data generated by the APM tool and stores the performance data in the local database;

s3: the baseline calculation unit acquires a dynamic baseline according to the historical time sequence performance data in the local database;

s4: the abnormality detection unit detects abnormality of the real-time performance data in the local database;

s5: and the alarm unit judges the health degree of the application cluster according to the abnormality detection result of the abnormality detection unit and sends out alarm information.

Preferably, the step S1 specifically includes:

s101: synchronizing application cluster information of the accessed APM tool;

s102: and judging whether a dynamic baseline is established, if so, entering a step S2, otherwise, returning to the step S101.

Preferably, the step S2 specifically includes:

s201: acquiring an APM data source;

s201: carrying out slice statistics on the APM data source to obtain a data slice;

s202: obtaining tissue key performance index data according to the data slice;

s203: the organization key performance indicator data is stored to a local relational database.

Preferably, the step S3 specifically includes:

s301: reading historical timing performance data from a local database using a python pandas module;

s302: converting the historical time sequence performance data format into a Dataframe;

s303: judging whether the historical time sequence performance data has abnormal data or not, if so, removing the abnormal data and entering S304, otherwise, directly entering S304;

s304: calculating the mean value and the variance of the same time point every day in the historical time sequence performance data, and generating a dynamic base line according to the mean value and the variance of the same time point every day in the historical data;

s305: the dynamic baseline is written to a database.

Preferably, in the step S304, a dynamic baseline maximum value is generated by adding 2 times of variance to the average value of the historical time series performance data, and a dynamic baseline minimum value is generated by subtracting 2 times of variance from the average value of the historical time series performance data.

Preferably, the step S4 specifically includes:

s401: acquiring real-time performance data in a local database;

s402: judging whether the real-time performance data meets an abnormal rule, if so, marking an abnormal label on the real-time performance data, writing the abnormal label into a local database, and entering a step S6, otherwise, entering a step S403;

s403: judging whether the real-time performance data is abnormal or not according to an abnormality detection algorithm, if so, marking an abnormality label on the real-time performance data, writing the abnormality label into a local database, and if not, marking a normal label on the real-time performance data, writing the abnormality label into the local database.

Preferably, the step S5 specifically includes:

step S501: polling and checking all application clusters, judging whether the abnormal times of the key performance indexes in the appointed abnormal time threshold exceeds the abnormal times threshold, if so, judging that the health degree of the application clusters is abnormal and entering into S502, otherwise, judging that the health degree of the application clusters is healthy;

step S502: and generating and sending alarm information.

Preferably, the method further comprises step S6: the front end display unit displays the dynamic baseline and performance index data using an echartis chart.

The step S6 specifically comprises the following steps:

step S601: the user selects a service system and application cluster information to be checked;

step S602: selecting index information and time period information to be displayed;

step S603: the front end queries a back end database according to information input by a user, organizes data, and displays dynamic base line and abnormal point data through an echarties discount chart.

Compared with the prior art, the invention has the following advantages:

(1) The application cluster health degree detection system and method are based on the data acquisition unit, the dynamic baseline calculation unit, the anomaly detection unit, the alarm unit, the task scheduling unit and the local database, can effectively store APM data of the application clusters, realize scientific detection of the application cluster health degree, effectively improve the accuracy and the high efficiency of the application cluster health degree detection, normalize the health degree detection flow, and improve the reliability of the application cluster health degree detection;

(2) The abnormality detection unit introduces an abnormality detection algorithm, helps to establish alarm settings of various performance indexes without manual intervention, and effectively improves the accuracy and effect of abnormality detection;

(3) The invention provides a time sequence-based dynamic base line establishment, which utilizes a 2 sigma method to process historical time sequence performance data, establishes a dynamic base line, can judge and process current real-time data based on the historical time sequence performance data, and improves reliability of health judgment;

(4) According to the invention, the abnormality detection unit supports a user to formulate an alarm triggering rule according to the adjustment of the abnormality time threshold and the abnormality times threshold, thereby greatly improving the accuracy and the accuracy of the alarm and reducing the false alarm times;

(5) The invention utilizes the front-end display unit to selectively display the dynamic base line and the abnormal point data, thereby improving the operability and applicability of the system.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of step S2 of the present invention;

FIG. 3 is a flow chart of step S3 of the present invention;

fig. 4 is a flow chart of step S4 of the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. Note that the following description of the embodiments is merely an example, and the present invention is not intended to be limited to the applications and uses thereof, and is not intended to be limited to the following embodiments.

Examples

A system for detecting the health of an application cluster comprises a data acquisition unit, a dynamic baseline calculation unit, an anomaly detection unit, an alarm unit, a task scheduling unit and a local database.

Specifically, the data acquisition unit is used for acquiring performance data generated by the APM tool and storing the performance data in the local database. In this embodiment, the collection frequency and the sampling range of the performance data are set, and the data collection unit collects the historical performance data generated by the APM tool according to the formulated collection frequency and sampling range, and finally stores the historical performance data in the local database.

The dynamic baseline calculation unit is used for generating a dynamic baseline according to the historical time sequence performance data in the local database, generating the dynamic baseline according to the sampled historical time sequence data, and introducing a 2 sigma principle to compensate abnormal data so as to improve the accuracy of the baseline.

The abnormality detection unit is used for detecting abnormality of the real-time performance data in the local database and labeling normal labels or abnormal labels on the real-time data. In this embodiment, an isolated forest algorithm is used to perform anomaly detection on the sampled real-time data, and an important index for judging whether the current application cluster is healthy is generated.

The alarm unit is used for judging the health degree of the application cluster, sending alarm information according to the health degree, checking the health degree of the application cluster according to the designated health checking frequency and rule, and simultaneously counting key information of abnormal time periods of the cluster to generate alarm information to inform a user of attention.

The local database is used to store performance data.

The system also comprises a front-end display unit, wherein the front-end display unit is used for displaying the dynamic baseline and the health degree of each application cluster index in real time. In addition, the system also comprises a task scheduling unit which is used for uniformly managing the operations such as data acquisition, dynamic baseline calculation, alarm polling check, data archiving and the like and is used as a daemon to ensure the normal work of all calculation modules.

s1: and initializing application cluster information to be monitored.

The step S1 specifically comprises the following steps:

s101: synchronizing application cluster information of the accessed APM tool;

In this embodiment, whether to establish a dynamic baseline of the application cluster information is determined according to a preset dynamic baseline establishment determination rule.

S2: the data acquisition unit acquires performance data generated by the APM tool and stores the performance data in the local database.

As shown in fig. 2, step S2 specifically includes:

s201: acquiring an APM data source;

s202: obtaining tissue key performance index data according to the data slice;

In this embodiment, slice statistics is performed on APM data sources with granularity of five minutes, and key performance index data obtained according to data slicing includes a total response number, a slow number, an error number, a slow rate, an error rate, and a requested URL.

S3: the baseline calculation unit obtains a dynamic baseline according to historical time sequence performance data in the local database.

As shown in fig. 3, in step S3, the baseline calculating unit calculates a dynamic baseline according to the historical performance data of the application cluster, where the dynamic baseline is used for showing the response situation of the transaction in each period, and predicting the transaction situation at the corresponding time point of the next day, and specifically includes:

s304: calculating the maximum value, the minimum value and the variance of the same time point every day in the historical time sequence performance data, and generating a dynamic base line according to the maximum value, the minimum value and the variance of the same time point every day in the historical data;

s305: the dynamic baseline is written to a database.

In the step S304, a 2-fold variance is added to the average value of the historical time series performance data to generate a dynamic baseline maximum value, and a 2-fold variance is subtracted from the average value of the historical time series performance data to generate a dynamic baseline minimum value:

wherein x is _max For a dynamic baseline maximum, x _min The dynamic baseline minimum value is set to be,is the mean value, sigma ² Is the variance.

S4: the abnormality detection unit performs abnormality detection on the real-time performance data in the local database.

As shown in fig. 4, S4 specifically includes:

s401: acquiring real-time performance data in a local database;

In this embodiment, label values { normal ] corresponding to normal and abnormal labels are defined: 1, anomaly-1 }, and defining an anomaly rule, and judging whether the real-time performance data is anomalous or not by using an isolated forest algorithm in S403.

The step S5 specifically comprises the following steps:

step S502: and generating and sending alarm information.

In this embodiment, S5 includes:

s501: polling to check all application clusters, counting the situation that a certain index has abnormal labels for more than 2 times within the last 30 minutes, and if so, entering S502;

s502: and organizing alarm information, and informing a user of attention through channels such as mail, short message and the like.

In this embodiment, the system of the present invention further includes a front-end display unit, where the front-end display unit is configured to display, in real time, a dynamic baseline and a health degree of each application cluster index.

Correspondingly, the method of the invention further comprises the step S6: the front end display unit displays the dynamic baseline and performance index data using an echartis chart.

The step S6 specifically comprises the following steps:

The above embodiments are merely examples, and do not limit the scope of the present invention. These embodiments may be implemented in various other ways, and various omissions, substitutions, and changes may be made without departing from the scope of the technical idea of the present invention.

Claims

1. A system for detecting the health degree of an application cluster is characterized by comprising a data acquisition unit, a dynamic baseline calculation unit, an abnormality detection unit, an alarm unit, a task scheduling unit and a local database,

the alarm unit is used for judging the health degree of the application cluster according to the abnormal detection result of the abnormal detection unit and sending out alarm information,

the local database is used to store performance data,

wherein the generation of the dynamic baseline comprises the following steps:

s305: the dynamic baseline is written to a database,

in step S304, a dynamic baseline maximum value is generated by adding 2 times of variance to the average value of the historical time series performance data, and a dynamic baseline minimum value is generated by subtracting 2 times of variance from the average value of the historical time series performance data.

2. The system for detecting health of application clusters according to claim 1, further comprising a front-end display unit, wherein the front-end display unit is configured to display the dynamic baseline and health of each application cluster indicator in real time.

3. A method for detecting the health of an application cluster, characterized in that a system for detecting the health of an application cluster according to claim 1 is based on the steps of:

s1: initializing application cluster information to be monitored;

s5: the alarm unit judges the health degree of the application cluster according to the abnormal detection result of the abnormal detection unit and sends out alarm information,

the step S3 specifically includes:

s305: the dynamic baseline is written to a database,

4. A method for detecting health of an application cluster according to claim 3, wherein said step S1 specifically comprises:

s101: synchronizing application cluster information of the accessed APM tool;

5. A method for detecting health of an application cluster according to claim 3, wherein said step S2 specifically comprises:

s201: acquiring an APM data source;

s202: obtaining tissue key performance index data according to the data slice;

6. A method for detecting health of an application cluster according to claim 3, wherein said step S4 specifically comprises:

s401: acquiring real-time performance data in a local database;

7. A method for detecting health of an application cluster according to claim 3, wherein said step S5 specifically comprises:

step S502: and generating and sending alarm information.

8. A method for detecting health of application clusters according to claim 3, wherein the system further comprises a front-end display unit, the front-end display unit is configured to display the dynamic baseline and health of each application cluster indicator in real time, and the method further comprises step S6: the front end display unit displays the dynamic baseline and performance index data using an echartis chart.