CN108599977B

CN108599977B - System and method for monitoring system availability based on statistical method

Info

Publication number: CN108599977B
Application number: CN201810150782.5A
Authority: CN
Inventors: 梅存兵
Original assignee: Nanjing Tuniu Technology Co ltd
Current assignee: Nanjing Tuniu Technology Co ltd
Priority date: 2018-02-13
Filing date: 2018-02-13
Publication date: 2021-09-28
Anticipated expiration: 2038-02-13
Also published as: CN108599977A

Abstract

The invention provides a system and a method for monitoring system availability based on a statistical method, wherein the system comprises the following steps: the inter-system service call log module, the alarm threshold analysis module, the alarm analysis module and the monitoring alarm display module; by collecting call logs among systems, historical data is analyzed and learned regularly, and general performance of each system is obtained; analyzing the data in the latest unit time t, and distinguishing whether the current error number of each system is abnormal, whether the error rate of inter-system calling is abnormal, and whether the availability of each instance of each service of the system is abnormal; and marking abnormal systems and abnormal inter-system call relations on the system topological graph in an alarm mode. When the alarm information is displayed, the system state, the state called between systems, the state of system service and the state of an example are displayed on a system topological graph, so that a problem system can be quickly positioned when a large-area system has problems.

Description

System and method for monitoring system availability based on statistical method

Technical Field

The invention belongs to the technical field of software system monitoring, and relates to a system and a method for monitoring system availability based on a statistical method.

Background

Internet enterprises generally include a large number of application systems, and besides externally open websites, APPs and the like, a large number of application systems are also provided inside the internet enterprises to support operation and management of the enterprises. There is usually a relatively complex calling relationship between internal application systems, and the function called by one system provided for another system is called a service. The availability monitoring industry for application systems generally takes the following measures:

the method comprises the following steps: using zabbix et al, monitor certain metrics of the system server, such as: the number of Web system processes/threads, CPU load, available memory, the number of http abnormal state codes, request response time and the like. And alarming when the index exceeds a set threshold value.

The second method comprises the following steps: and the simulation client terminal carries out periodic calling and detects whether indexes such as content, speed and the like responded by the server terminal system meet the set threshold value. And alarming when the index exceeds a set threshold value.

However, the existing monitoring mode has various defects:

1. the threshold values in the first method and the second method need to be manually set, the threshold values of different systems are different, the threshold values of the same system in different periods are also completely different, and the setting and the maintenance of the threshold values have large workload. In actual operation, a trial and error method is generally adopted, namely the threshold is widened after false alarm and tightened after false alarm, so that the false alarm rate and the false alarm rate are high.

2. The monitoring of the method I only partially reflects the availability and cannot be used as an actual availability index, and the detected abnormality does not represent that the availability of the system is reduced, and the detected abnormality does not reflect the monitoring indexes when the system is unavailable.

3. The monitoring of the second method directly reflects the usability, but the second method is used as a sampling inspection means, has fewer samples and narrower coverage, and can only monitor read operation and is less used for write operation.

4. When the system is more and more complex, the two monitoring methods have too many indexes, more alarming quantity and large alarming noise, which can affect the judgment and positioning of the problems.

5. When a new system is online, a new service is online, and the system and service deployment are changed, the two monitoring methods both need to manually maintain monitoring items, and are not suitable for systems with automatic switching of faults and dynamic service expansion capability.

6. When an error rate monitoring alarm is performed, the threshold method often causes false alarm, for example, when the error rate requirement does not exceed 1%, the alarm is given if only one operation occurs and the operation fails (the error rate is 100%), but in most cases, the alarm is not needed.

7. When a plurality of systems of a complex system cluster simultaneously have faults, the real fault system is difficult to quickly locate, only one beard and eyebrow can be grabbed, and precious time is wasted.

Disclosure of Invention

In order to solve the problems, the invention provides a system and a method for monitoring system availability based on a statistical method, which periodically analyze and learn historical data by acquiring call logs among systems to obtain the general performance of each system; analyzing the data in the latest unit time t, and distinguishing whether the current error number of each system is abnormal, whether the error rate of inter-system calling is abnormal, and whether the availability of each instance of each service of the system is abnormal; and marking abnormal systems and abnormal inter-system call relations on the system topological graph in an alarm mode.

In order to achieve the purpose, the invention provides the following technical scheme:

a system for monitoring system availability based on statistical methods, comprising: the inter-system service call log module, the alarm threshold analysis module, the alarm analysis module and the monitoring alarm display module;

the inter-system service calling log module is used for collecting and recording log information of all calling between systems, calling time, IP and port number of a calling party, IP and port number of a called party, service identification of calling and success or failure;

the alarm threshold analysis module is used for regularly learning historical data, finding out the performance of each system under the general situation, obtaining n samples of the current system, wherein each sample describes the error number in unit time t, and removing abnormal points in a sample set, and the process of removing the abnormal points comprises the following steps:

i. calculating a mean of a current sample set

And standard deviation of

Finding all sample points which are more than u +3 std in the sample set, calculating the number n1 of the sample points, removing the data from the sample set, and calculating the number n2 of a new sample set after the sample points are taken;

if the condition is met, finishing the removal of the abnormal point, and continuing to execute the following steps; otherwise, executing step i;

calculating the alarm line alert num of the system as u + std 3;

the alarm analysis module is used for regularly acquiring logs in the latest t time period, successively analyzing whether the error number of each system is abnormal, whether each service error rate of each system is abnormal, whether the instance error rate is abnormal and the error number condition between any two systems, and specifically judging the following conditions after traversing the system list:

a) if the accumulated error number of the system is larger than the alarm threshold value of the system, the system is marked to be abnormal;

b) traversing each service of the system, applying an abnormal judgment method to judge whether the error rate is abnormal or not;

c) traversing each instance of the system, and applying an abnormality judgment method to judge whether the error rate of the system is abnormal or not;

the abnormality determination method includes:

a) recording the correct times as tNum, recording the wrong times as fNum, and recording the total calling times num as tNum + fNum;

b) if fNum < the first threshold, returning to normal; otherwise, the next step;

c) if num > is the second threshold, then the next step is carried out, otherwise, when fNum is the first threshold, the operation returns to normal, otherwise, the operation returns to abnormal;

d) if fNum/num is less than the third threshold value, returning to normal; otherwise, the next step;

e) when fNum < tNum, k ═ fNum + fourth threshold, otherwise k ═ fNum-fifth threshold;

f) computing

If z > a sixth threshold, returning to abnormal, otherwise returning to normal;

after finishing the data, calculating the error number of each group of clientInstances calling the server Instances;

the system client and the server corresponding to the client instance and the server instance are reversely checked from the system topological graph, and the accumulated error number of each group of client system calling the server system is counted;

and the alarm display module is used for displaying the alarm data on the system topological graph after the alarm data is analyzed based on the system topological graph.

Further, the alarm threshold analysis module is further configured to set an alarm threshold, and after calculating the error number alarm line of the system, if alert num < alarm threshold, alert num is set as the alarm threshold.

Further, the conditions during the process of removing the abnormal point are as follows:

n1 is 0 or (n-n2) >30 or (n-n2) > n/3.

Further, the alarm display module is further configured to:

1. when the system is abnormal, adding a warning mark on a system icon:

2. when the service and the instance of the system are abnormal, clicking a system icon, and displaying error information by a popup layer;

3. and when the number of the intersystem calling errors is not 0, drawing a connecting line and a directional arrow.

Further, the width of the connecting line is related to the error number.

The method for monitoring the availability of the system based on the statistical method comprises the following steps:

acquiring and recording log information of all calls between systems, wherein the log information comprises call time, an IP and a port number of a calling party, an IP and a port number of a called party and a calling service identifier, and whether the log information is successful or not;

step two, regularly learning the historical data, finding out the performance of each system under the general situation, obtaining n samples of the current system, wherein each sample describes the error number in unit time t, and removing abnormal points in a sample set, and the process of removing the abnormal points comprises the following steps:

i. calculating a mean of a current sample set

And standard deviation of

calculating the alarm line alert num of the system as u + std 3;

step three, regularly collecting logs in the latest time period t, successively analyzing whether the error number of each system is abnormal, whether each service error rate of each system is abnormal, whether the instance error rate is abnormal and the error number condition between any two systems, and specifically judging the following conditions after traversing the system list:

the abnormality determination method includes:

f) computing

If z > a sixth threshold, returning to abnormal, otherwise returning to normal;

and step four, based on the system topological graph, displaying the alarm data on the system topological graph after the alarm data is analyzed.

Further, the first step further comprises:

an alarm threshold is set, and after an error number alarm line of the system is calculated, if alert num < alarm threshold, alert num is set to the alarm threshold.

Further, the conditions in the process of removing the abnormal point in the second step are as follows:

n1 is 0 or (n-n2) >30 or (n-n2) > n/3.

Further, the fourth step further comprises the following steps:

1. when the system is abnormal, adding a warning mark on a system icon:

Further, the width of the connecting line is related to the error number.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention can monitor whether the system, the system service and the system instance are abnormal or not by analyzing the service call logs among the instances, and display the alarm information by combining the topological graph of the system; when the alarm information is displayed, the system state, the state called between systems, the state of system service and the state of an example are displayed on a system topological graph, so that a problem system can be quickly positioned when a large-area system has problems.

2. The invention obtains the alarm threshold value by analyzing the normal performance of the system in the past period of time; when the alarm analysis is carried out, the alarm is carried out when the error number exceeds the threshold value; for the quantity type alarm, an automatic setting method of an alarm threshold value is provided, the labor is reduced, the alarm accuracy is improved, and the situations of false alarm and missed alarm are greatly reduced. After the new system is operated on line for a period of time, the invention can automatically set an alarm threshold value for the new system.

3. Whether the service and the instance of the analysis system are abnormal or not can be checked, the accuracy of the alarm is improved for the proportional alarm, and the false alarm and the false negative alarm are reduced.

4. The monitoring method samples actual data, and the coverage is more comprehensive than that of regular sampling.

Drawings

FIG. 1 is a schematic diagram of a normal distribution.

Fig. 2 is an exemplary diagram of a log format.

FIG. 3 is a flow chart of alarm threshold analysis.

Fig. 4 is a diagram showing an error count of a sample in the system obtained by calling the Logstash interface.

FIG. 5 is a diagram of inter-instance call data.

FIG. 6 is a system topology diagram with warning flags added.

FIG. 7 is a system topology diagram showing error information at the bullet layer.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention. Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions and, although a logical order is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than here.

The error number of a system in unit time t is considered to be influenced by a plurality of factors of independent random factors, and the influence of each factor is small in general, so that the error number can be researched as a random variable which obeys normal distribution. The density function of a normal distribution is:

by collecting the performance data of the system in the past under the general condition of a period of time, the average value u and the standard deviation std of the error number in the unit time t can be calculated. Let the number of errors in the last unit time t be denoted as failNum, and as shown in fig. 1, we can easily calculate that the probability P (failNum > -u +3 × std) is much less than 0.01, i.e. it is an extremely small probability event. Therefore, when the number of errors occurring in the system in the last unit time t is observed, the condition that the value of the errors exceeds the mean value plus three standard deviations is an extreme condition, manual attention is needed, and alarm information should be sent.

When studying the error rate of the system and the service by observing the error number of the system, we can easily find that: even if the highest acceptable value of the error rate p0 is 0.01, actually observing 100 calls, failure of more than 1 call cannot indicate a problem in the system, because it is an event with a high probability of occurrence.

When the number of calls is small (here we take less than 40), we calculate the conditional probability p1 when the error rate p of the system is not higher than p0, but the number of errors, failNum, observed in n calls is greater than failLevel:

we refer to events with an occurrence probability below 0.05 as small probability events. In a small limited number of experiments, a small probability event should not occur, that is, when the small probability event occurs, we cannot consider that p is not higher than p0, but should consider that p is higher than p0, and when the system error rate is too high, an alarm should be sent. By numerical operations, we find all the critical points of failLevel that make p < 0.05: the critical point of the failLevel is 0 when n < 5, 1 when 5< n < 35, and 2 when 35< n < 40. Namely: observing n times of calling, and if the error number is higher than the corresponding failLevel, considering that a small probability event occurs and needing attention; if not, the system is considered normal. For the sake of convenience, we set the failLevel at n <40 uniformly to 1, with the error in practice within an acceptable range.

When the number of calls is large (we consider no less than 40), the error rate we observe is p1, the error rate of the system is p per se, and is normally no higher than p 0. From the central limit theorem, we know that p1 approximately follows a normal distribution with mean p and variance p (1-p)/n, i.e., statistics

Obeying a standard normal distribution. When p is<When the number is p0, the number of the channels,

approximately obey a standard normal distribution; according to the standard normal distribution quantile table, when

When the probability is lower than 0.05, the event is a small probability event, and attention should be paid and an alarm should be sent. For convenient application, we will

The deformation is as follows:

where n × p1 is the number of errors actually observed.

The corresponding abnormity judging method comprises the following steps:

b) if fNum <1, returning to normal; otherwise, the next step;

c) if num > is 40, then the next step is carried out, otherwise, when fNum is 1, the normal operation is returned, otherwise, the abnormal operation is returned;

d) if fNum/num is less than 0.01, returning to normal; otherwise, the next step;

e) when fNum < tNum, k is fNum +0.5, otherwise k is fNum-0.5 (since it is approximately normal distribution, the statistics can be made more approximate to normal distribution by correction);

f) computing

If z >1.645 then return exception, otherwise return to normal.

Each data in the abnormality determination method can be adjusted as needed.

The invention provides a system for monitoring system availability based on a statistical method, which comprises the following steps: the system comprises an intersystem service call log module, an alarm threshold analysis module, an alarm analysis module and a monitoring alarm display module. The service list and the instance list of each system can be directly obtained from the system topological graph. The service list, the instance list, and the logs related to the services and the instances are described in more detail in the invention patent with application number 2017109039551 filed by the same department entitled system deployment and dependency relationship automatic drawing system and method.

The inter-system service call log module collects and records log information of all calls between systems. Specifically, the method comprises the following steps: we refer to a specific deployment of a system (Application) on a server as an Instance (Instance), which is uniquely identified by the IP of the server where it is located and the port number occupied by the Instance. After one instance calls a service of another instance, the caller records a call log (as shown), which includes: call time (startTime), caller ip (consumerip) and port number (consumerPort), callee ip (serviceip) and port number (servicePort), called service identifier (serviceName), success or failure (success). The log module for the service call among systems stores the logs by using an open source tool of Logstash, and can save the data within 2 seconds after the call action is finished. The storage log is shown in fig. 2.

The alarm threshold analysis module learns the historical data periodically to find out the performance of each system under the general situation, and the specific implementation process is shown in fig. 3 and comprises the following steps:

1. traversing the system list:

a) obtaining all service lists of the current system;

b) calling a Logstash interface, taking the accumulated error number of all services of the system in the latest n × t time range, and dividing the accumulated error number into n parts by taking t as a unit, namely, obtaining n samples of the current system, wherein each sample describes the error number in the unit time t, and is shown in fig. 4;

c) removing abnormal points in the sample set:

i. calculating a mean of a current sample set

And standard deviation of

Finding all sample points which are more than u +3 std in the sample set, calculating the number n1 of the sample points, removing the data from the sample set, and calculating the number n2 of a new sample set after the sample points are taken out, wherein the historical abnormal conditions are found out and removed, so that the influence on the evaluation of the general performance of the system is avoided;

if n1 is 0 or (n-n2) >30 or (n-n2) > n/3, completing the removal of the outlier and continuing to perform step d); otherwise, executing step i;

d) calculating the mean value u and the standard deviation std of the new sample set;

calculating the alarm line alert num of the system as u + std 3; alert num is set to 100 if alert num < 100.

The alarm analysis module regularly collects logs (for example, the collection is performed once per minute, and the collection interval can be adjusted as required) in the latest t time period, and successively analyzes whether the error number of each system is abnormal, whether each service error rate of each system is abnormal, whether the instance error rate is abnormal, and the error number between any two systems, and the specific method is as follows:

1. extracting the log of the latest time t from the Logstash, namely exhausting the relation of successful number and failure number of calling b instance c service by all existing a instances, as shown in fig. 5, recording the relation as data:

2. sorting the data, merging the field consumerIp and the field consumerPort into a field clientInstance, and merging the field serviceIp and the field servicePort into a field serviceInstance;

3. sorting data, and calculating the accumulated correct number and the accumulated error number of each server instance and each serviceName;

4. traversing the system list:

a) counting the accumulated error number of each system, namely the sum of the error numbers of all services under the system;

b) if the accumulated error number of the system is larger than the alarm threshold value alert num of the system, marking the system as abnormal;

c) traversing each service (serviceName) of the system, applying the abnormity judgment method, and substituting the accumulative correct number and the accumulative error number of the serviceName to judge whether the serviceName is abnormal or not;

d) traversing each instance (server instance) of the system, applying the above-mentioned abnormal judgment method, substituting the accumulated correct number and the accumulated error number of the server instance to judge whether the system is abnormal;

5. sorting data, and calculating the error number of each group of clientInstances calling server Instances;

and reversely checking the system clients and servers corresponding to the client instance and the server instance from the system topological graph, and counting the accumulated error number of each group of client system calling the server system.

And the alarm display module is based on the system topological graph and displays the alarm data on the system topological graph after the alarm data is analyzed.

1. When the system is abnormal, adding a warning mark on a system icon, as shown in FIG. 6;

2. when the service and the instance of the system are abnormal, clicking a system icon, and displaying error information by a popup layer, as shown in FIG. 7;

3. when the number of inter-system call errors is not 0, drawing a connecting line and a directional arrow, wherein the width of the line is the logarithm of the number of errors. Other conventional formulas for substituting the error number may be used to calculate the line width, as long as the line width or color is correlated to the error number to meet the requirements of the present invention.

When the system fails, we can easily see from the figure that: which systems have problems, which systems are affected, which instances of systems and services have problems.

The invention also provides a method for monitoring the system availability based on the statistical method, which comprises the steps of service call logs among systems; analyzing an alarm threshold; alarm analysis; monitoring and alarming display; the inter-system service call log step executes contents realized by the inter-system service call log module, the alarm threshold analysis step executes contents realized by the alarm threshold analysis module, the alarm analysis step executes contents realized by the alarm analysis module, and the monitoring alarm display step executes contents realized by the monitoring alarm display module.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. A system for monitoring system availability based on statistical methods, comprising: the inter-system service call log module, the alarm threshold analysis module, the alarm analysis module and the monitoring alarm display module;

i. calculating a mean of a current sample set

And standard deviation of

if n1 is 0 or (n-n2) >30 or (n-n2) > n/3, completing the removal of the abnormal point and continuing to execute the following steps; otherwise, executing step i;

calculating the alarm line alert num of the system as u + std 3;

the abnormality determination method includes:

f) computing

If z > a sixth threshold, returning to abnormal, otherwise returning to normal;

and the monitoring alarm display module is used for displaying the alarm data on the system topological graph after the alarm data is analyzed based on the system topological graph.

2. The system for monitoring system availability based on statistical methods of claim 1, wherein: the alarm threshold analysis module is further configured to set an alarm threshold, and after calculating the error number alarm line of the system, set alert num as the alarm threshold if alert num < alarm threshold.

3. The system for monitoring system availability based on statistical methods of claim 1, wherein: the monitoring alarm display module is also used for:

(1) when the system is abnormal, adding a warning mark on a system icon:

(2) when the service and the instance of the system are abnormal, clicking a system icon, and displaying error information by a popup layer;

(3) and when the number of the intersystem calling errors is not 0, drawing a connecting line and a directional arrow.

4. The system for monitoring system availability based on statistical methods of claim 3, wherein: the width of the connection line is related to the number of errors.

5. A method for monitoring system availability based on a statistical method is characterized by comprising the following steps:

i. calculating a mean of a current sample set

And standard deviation of

calculating the alarm line alert num of the system as u + std 3;

the abnormality determination method includes:

f) computing

If z > a sixth threshold, returning to abnormal, otherwise returning to normal;

6. The method of claim 5, wherein step one further comprises:

an alarm threshold is set, and after an error number alarm line of the system is calculated, alert num is set as the alarm threshold if alert num < alarm threshold.

7. The method for statistically monitoring system availability of claim 5, wherein step four further comprises the steps of:

(1) when the system is abnormal, adding a warning mark on a system icon:

8. The method of statistically monitoring system availability of claim 7, wherein: the width of the connection line is related to the number of errors.