CN108599977B - System and method for monitoring system availability based on statistical method - Google Patents

System and method for monitoring system availability based on statistical method Download PDF

Info

Publication number
CN108599977B
CN108599977B CN201810150782.5A CN201810150782A CN108599977B CN 108599977 B CN108599977 B CN 108599977B CN 201810150782 A CN201810150782 A CN 201810150782A CN 108599977 B CN108599977 B CN 108599977B
Authority
CN
China
Prior art keywords
abnormal
alarm
threshold
fnum
calling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810150782.5A
Other languages
Chinese (zh)
Other versions
CN108599977A (en
Inventor
梅存兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tuniu Technology Co ltd
Original Assignee
Nanjing Tuniu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tuniu Technology Co ltd filed Critical Nanjing Tuniu Technology Co ltd
Priority to CN201810150782.5A priority Critical patent/CN108599977B/en
Publication of CN108599977A publication Critical patent/CN108599977A/en
Application granted granted Critical
Publication of CN108599977B publication Critical patent/CN108599977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0681Configuration of triggering conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/22Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Environmental & Geological Engineering (AREA)
  • Telephonic Communication Services (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a system and a method for monitoring system availability based on a statistical method, wherein the system comprises the following steps: the inter-system service call log module, the alarm threshold analysis module, the alarm analysis module and the monitoring alarm display module; by collecting call logs among systems, historical data is analyzed and learned regularly, and general performance of each system is obtained; analyzing the data in the latest unit time t, and distinguishing whether the current error number of each system is abnormal, whether the error rate of inter-system calling is abnormal, and whether the availability of each instance of each service of the system is abnormal; and marking abnormal systems and abnormal inter-system call relations on the system topological graph in an alarm mode. When the alarm information is displayed, the system state, the state called between systems, the state of system service and the state of an example are displayed on a system topological graph, so that a problem system can be quickly positioned when a large-area system has problems.

Description

System and method for monitoring system availability based on statistical method
Technical Field
The invention belongs to the technical field of software system monitoring, and relates to a system and a method for monitoring system availability based on a statistical method.
Background
Internet enterprises generally include a large number of application systems, and besides externally open websites, APPs and the like, a large number of application systems are also provided inside the internet enterprises to support operation and management of the enterprises. There is usually a relatively complex calling relationship between internal application systems, and the function called by one system provided for another system is called a service. The availability monitoring industry for application systems generally takes the following measures:
the method comprises the following steps: using zabbix et al, monitor certain metrics of the system server, such as: the number of Web system processes/threads, CPU load, available memory, the number of http abnormal state codes, request response time and the like. And alarming when the index exceeds a set threshold value.
The second method comprises the following steps: and the simulation client terminal carries out periodic calling and detects whether indexes such as content, speed and the like responded by the server terminal system meet the set threshold value. And alarming when the index exceeds a set threshold value.
However, the existing monitoring mode has various defects:
1. the threshold values in the first method and the second method need to be manually set, the threshold values of different systems are different, the threshold values of the same system in different periods are also completely different, and the setting and the maintenance of the threshold values have large workload. In actual operation, a trial and error method is generally adopted, namely the threshold is widened after false alarm and tightened after false alarm, so that the false alarm rate and the false alarm rate are high.
2. The monitoring of the method I only partially reflects the availability and cannot be used as an actual availability index, and the detected abnormality does not represent that the availability of the system is reduced, and the detected abnormality does not reflect the monitoring indexes when the system is unavailable.
3. The monitoring of the second method directly reflects the usability, but the second method is used as a sampling inspection means, has fewer samples and narrower coverage, and can only monitor read operation and is less used for write operation.
4. When the system is more and more complex, the two monitoring methods have too many indexes, more alarming quantity and large alarming noise, which can affect the judgment and positioning of the problems.
5. When a new system is online, a new service is online, and the system and service deployment are changed, the two monitoring methods both need to manually maintain monitoring items, and are not suitable for systems with automatic switching of faults and dynamic service expansion capability.
6. When an error rate monitoring alarm is performed, the threshold method often causes false alarm, for example, when the error rate requirement does not exceed 1%, the alarm is given if only one operation occurs and the operation fails (the error rate is 100%), but in most cases, the alarm is not needed.
7. When a plurality of systems of a complex system cluster simultaneously have faults, the real fault system is difficult to quickly locate, only one beard and eyebrow can be grabbed, and precious time is wasted.
Disclosure of Invention
In order to solve the problems, the invention provides a system and a method for monitoring system availability based on a statistical method, which periodically analyze and learn historical data by acquiring call logs among systems to obtain the general performance of each system; analyzing the data in the latest unit time t, and distinguishing whether the current error number of each system is abnormal, whether the error rate of inter-system calling is abnormal, and whether the availability of each instance of each service of the system is abnormal; and marking abnormal systems and abnormal inter-system call relations on the system topological graph in an alarm mode.
In order to achieve the purpose, the invention provides the following technical scheme:
a system for monitoring system availability based on statistical methods, comprising: the inter-system service call log module, the alarm threshold analysis module, the alarm analysis module and the monitoring alarm display module;
the inter-system service calling log module is used for collecting and recording log information of all calling between systems, calling time, IP and port number of a calling party, IP and port number of a called party, service identification of calling and success or failure;
the alarm threshold analysis module is used for regularly learning historical data, finding out the performance of each system under the general situation, obtaining n samples of the current system, wherein each sample describes the error number in unit time t, and removing abnormal points in a sample set, and the process of removing the abnormal points comprises the following steps:
i. calculating a mean of a current sample set
Figure BDA0001579869290000021
And standard deviation of
Figure BDA0001579869290000022
Finding all sample points which are more than u +3 std in the sample set, calculating the number n1 of the sample points, removing the data from the sample set, and calculating the number n2 of a new sample set after the sample points are taken;
if the condition is met, finishing the removal of the abnormal point, and continuing to execute the following steps; otherwise, executing step i;
calculating the alarm line alert num of the system as u + std 3;
the alarm analysis module is used for regularly acquiring logs in the latest t time period, successively analyzing whether the error number of each system is abnormal, whether each service error rate of each system is abnormal, whether the instance error rate is abnormal and the error number condition between any two systems, and specifically judging the following conditions after traversing the system list:
a) if the accumulated error number of the system is larger than the alarm threshold value of the system, the system is marked to be abnormal;
b) traversing each service of the system, applying an abnormal judgment method to judge whether the error rate is abnormal or not;
c) traversing each instance of the system, and applying an abnormality judgment method to judge whether the error rate of the system is abnormal or not;
the abnormality determination method includes:
a) recording the correct times as tNum, recording the wrong times as fNum, and recording the total calling times num as tNum + fNum;
b) if fNum < the first threshold, returning to normal; otherwise, the next step;
c) if num > is the second threshold, then the next step is carried out, otherwise, when fNum is the first threshold, the operation returns to normal, otherwise, the operation returns to abnormal;
d) if fNum/num is less than the third threshold value, returning to normal; otherwise, the next step;
e) when fNum < tNum, k ═ fNum + fourth threshold, otherwise k ═ fNum-fifth threshold;
f) computing
Figure BDA0001579869290000031
If z > a sixth threshold, returning to abnormal, otherwise returning to normal;
after finishing the data, calculating the error number of each group of clientInstances calling the server Instances;
the system client and the server corresponding to the client instance and the server instance are reversely checked from the system topological graph, and the accumulated error number of each group of client system calling the server system is counted;
and the alarm display module is used for displaying the alarm data on the system topological graph after the alarm data is analyzed based on the system topological graph.
Further, the alarm threshold analysis module is further configured to set an alarm threshold, and after calculating the error number alarm line of the system, if alert num < alarm threshold, alert num is set as the alarm threshold.
Further, the conditions during the process of removing the abnormal point are as follows:
n1 is 0 or (n-n2) >30 or (n-n2) > n/3.
Further, the alarm display module is further configured to:
1. when the system is abnormal, adding a warning mark on a system icon:
2. when the service and the instance of the system are abnormal, clicking a system icon, and displaying error information by a popup layer;
3. and when the number of the intersystem calling errors is not 0, drawing a connecting line and a directional arrow.
Further, the width of the connecting line is related to the error number.
The method for monitoring the availability of the system based on the statistical method comprises the following steps:
acquiring and recording log information of all calls between systems, wherein the log information comprises call time, an IP and a port number of a calling party, an IP and a port number of a called party and a calling service identifier, and whether the log information is successful or not;
step two, regularly learning the historical data, finding out the performance of each system under the general situation, obtaining n samples of the current system, wherein each sample describes the error number in unit time t, and removing abnormal points in a sample set, and the process of removing the abnormal points comprises the following steps:
i. calculating a mean of a current sample set
Figure BDA0001579869290000032
And standard deviation of
Figure BDA0001579869290000033
Finding all sample points which are more than u +3 std in the sample set, calculating the number n1 of the sample points, removing the data from the sample set, and calculating the number n2 of a new sample set after the sample points are taken;
if the condition is met, finishing the removal of the abnormal point, and continuing to execute the following steps; otherwise, executing step i;
calculating the alarm line alert num of the system as u + std 3;
step three, regularly collecting logs in the latest time period t, successively analyzing whether the error number of each system is abnormal, whether each service error rate of each system is abnormal, whether the instance error rate is abnormal and the error number condition between any two systems, and specifically judging the following conditions after traversing the system list:
a) if the accumulated error number of the system is larger than the alarm threshold value of the system, the system is marked to be abnormal;
b) traversing each service of the system, applying an abnormal judgment method to judge whether the error rate is abnormal or not;
c) traversing each instance of the system, and applying an abnormality judgment method to judge whether the error rate of the system is abnormal or not;
the abnormality determination method includes:
a) recording the correct times as tNum, recording the wrong times as fNum, and recording the total calling times num as tNum + fNum;
b) if fNum < the first threshold, returning to normal; otherwise, the next step;
c) if num > is the second threshold, then the next step is carried out, otherwise, when fNum is the first threshold, the operation returns to normal, otherwise, the operation returns to abnormal;
d) if fNum/num is less than the third threshold value, returning to normal; otherwise, the next step;
e) when fNum < tNum, k ═ fNum + fourth threshold, otherwise k ═ fNum-fifth threshold;
f) computing
Figure BDA0001579869290000041
If z > a sixth threshold, returning to abnormal, otherwise returning to normal;
after finishing the data, calculating the error number of each group of clientInstances calling the server Instances;
the system client and the server corresponding to the client instance and the server instance are reversely checked from the system topological graph, and the accumulated error number of each group of client system calling the server system is counted;
and step four, based on the system topological graph, displaying the alarm data on the system topological graph after the alarm data is analyzed.
Further, the first step further comprises:
an alarm threshold is set, and after an error number alarm line of the system is calculated, if alert num < alarm threshold, alert num is set to the alarm threshold.
Further, the conditions in the process of removing the abnormal point in the second step are as follows:
n1 is 0 or (n-n2) >30 or (n-n2) > n/3.
Further, the fourth step further comprises the following steps:
1. when the system is abnormal, adding a warning mark on a system icon:
2. when the service and the instance of the system are abnormal, clicking a system icon, and displaying error information by a popup layer;
3. and when the number of the intersystem calling errors is not 0, drawing a connecting line and a directional arrow.
Further, the width of the connecting line is related to the error number.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention can monitor whether the system, the system service and the system instance are abnormal or not by analyzing the service call logs among the instances, and display the alarm information by combining the topological graph of the system; when the alarm information is displayed, the system state, the state called between systems, the state of system service and the state of an example are displayed on a system topological graph, so that a problem system can be quickly positioned when a large-area system has problems.
2. The invention obtains the alarm threshold value by analyzing the normal performance of the system in the past period of time; when the alarm analysis is carried out, the alarm is carried out when the error number exceeds the threshold value; for the quantity type alarm, an automatic setting method of an alarm threshold value is provided, the labor is reduced, the alarm accuracy is improved, and the situations of false alarm and missed alarm are greatly reduced. After the new system is operated on line for a period of time, the invention can automatically set an alarm threshold value for the new system.
3. Whether the service and the instance of the analysis system are abnormal or not can be checked, the accuracy of the alarm is improved for the proportional alarm, and the false alarm and the false negative alarm are reduced.
4. The monitoring method samples actual data, and the coverage is more comprehensive than that of regular sampling.
Drawings
FIG. 1 is a schematic diagram of a normal distribution.
Fig. 2 is an exemplary diagram of a log format.
FIG. 3 is a flow chart of alarm threshold analysis.
Fig. 4 is a diagram showing an error count of a sample in the system obtained by calling the Logstash interface.
FIG. 5 is a diagram of inter-instance call data.
FIG. 6 is a system topology diagram with warning flags added.
FIG. 7 is a system topology diagram showing error information at the bullet layer.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention. Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions and, although a logical order is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than here.
The error number of a system in unit time t is considered to be influenced by a plurality of factors of independent random factors, and the influence of each factor is small in general, so that the error number can be researched as a random variable which obeys normal distribution. The density function of a normal distribution is:
Figure 1
by collecting the performance data of the system in the past under the general condition of a period of time, the average value u and the standard deviation std of the error number in the unit time t can be calculated. Let the number of errors in the last unit time t be denoted as failNum, and as shown in fig. 1, we can easily calculate that the probability P (failNum > -u +3 × std) is much less than 0.01, i.e. it is an extremely small probability event. Therefore, when the number of errors occurring in the system in the last unit time t is observed, the condition that the value of the errors exceeds the mean value plus three standard deviations is an extreme condition, manual attention is needed, and alarm information should be sent.
When studying the error rate of the system and the service by observing the error number of the system, we can easily find that: even if the highest acceptable value of the error rate p0 is 0.01, actually observing 100 calls, failure of more than 1 call cannot indicate a problem in the system, because it is an event with a high probability of occurrence.
When the number of calls is small (here we take less than 40), we calculate the conditional probability p1 when the error rate p of the system is not higher than p0, but the number of errors, failNum, observed in n calls is greater than failLevel:
Figure BDA0001579869290000061
we refer to events with an occurrence probability below 0.05 as small probability events. In a small limited number of experiments, a small probability event should not occur, that is, when the small probability event occurs, we cannot consider that p is not higher than p0, but should consider that p is higher than p0, and when the system error rate is too high, an alarm should be sent. By numerical operations, we find all the critical points of failLevel that make p < 0.05: the critical point of the failLevel is 0 when n < 5, 1 when 5< n < 35, and 2 when 35< n < 40. Namely: observing n times of calling, and if the error number is higher than the corresponding failLevel, considering that a small probability event occurs and needing attention; if not, the system is considered normal. For the sake of convenience, we set the failLevel at n <40 uniformly to 1, with the error in practice within an acceptable range.
When the number of calls is large (we consider no less than 40), the error rate we observe is p1, the error rate of the system is p per se, and is normally no higher than p 0. From the central limit theorem, we know that p1 approximately follows a normal distribution with mean p and variance p (1-p)/n, i.e., statistics
Figure 2
Obeying a standard normal distribution. When p is<When the number is p0, the number of the channels,
Figure 3
approximately obey a standard normal distribution; according to the standard normal distribution quantile table, when
Figure 4
When the probability is lower than 0.05, the event is a small probability event, and attention should be paid and an alarm should be sent. For convenient application, we will
Figure 5
The deformation is as follows:
Figure BDA0001579869290000066
where n × p1 is the number of errors actually observed.
The corresponding abnormity judging method comprises the following steps:
a) recording the correct times as tNum, recording the wrong times as fNum, and recording the total calling times num as tNum + fNum;
b) if fNum <1, returning to normal; otherwise, the next step;
c) if num > is 40, then the next step is carried out, otherwise, when fNum is 1, the normal operation is returned, otherwise, the abnormal operation is returned;
d) if fNum/num is less than 0.01, returning to normal; otherwise, the next step;
e) when fNum < tNum, k is fNum +0.5, otherwise k is fNum-0.5 (since it is approximately normal distribution, the statistics can be made more approximate to normal distribution by correction);
f) computing
Figure BDA0001579869290000071
If z >1.645 then return exception, otherwise return to normal.
Each data in the abnormality determination method can be adjusted as needed.
The invention provides a system for monitoring system availability based on a statistical method, which comprises the following steps: the system comprises an intersystem service call log module, an alarm threshold analysis module, an alarm analysis module and a monitoring alarm display module. The service list and the instance list of each system can be directly obtained from the system topological graph. The service list, the instance list, and the logs related to the services and the instances are described in more detail in the invention patent with application number 2017109039551 filed by the same department entitled system deployment and dependency relationship automatic drawing system and method.
The inter-system service call log module collects and records log information of all calls between systems. Specifically, the method comprises the following steps: we refer to a specific deployment of a system (Application) on a server as an Instance (Instance), which is uniquely identified by the IP of the server where it is located and the port number occupied by the Instance. After one instance calls a service of another instance, the caller records a call log (as shown), which includes: call time (startTime), caller ip (consumerip) and port number (consumerPort), callee ip (serviceip) and port number (servicePort), called service identifier (serviceName), success or failure (success). The log module for the service call among systems stores the logs by using an open source tool of Logstash, and can save the data within 2 seconds after the call action is finished. The storage log is shown in fig. 2.
The alarm threshold analysis module learns the historical data periodically to find out the performance of each system under the general situation, and the specific implementation process is shown in fig. 3 and comprises the following steps:
1. traversing the system list:
a) obtaining all service lists of the current system;
b) calling a Logstash interface, taking the accumulated error number of all services of the system in the latest n × t time range, and dividing the accumulated error number into n parts by taking t as a unit, namely, obtaining n samples of the current system, wherein each sample describes the error number in the unit time t, and is shown in fig. 4;
c) removing abnormal points in the sample set:
i. calculating a mean of a current sample set
Figure BDA0001579869290000072
And standard deviation of
Figure BDA0001579869290000073
Finding all sample points which are more than u +3 std in the sample set, calculating the number n1 of the sample points, removing the data from the sample set, and calculating the number n2 of a new sample set after the sample points are taken out, wherein the historical abnormal conditions are found out and removed, so that the influence on the evaluation of the general performance of the system is avoided;
if n1 is 0 or (n-n2) >30 or (n-n2) > n/3, completing the removal of the outlier and continuing to perform step d); otherwise, executing step i;
d) calculating the mean value u and the standard deviation std of the new sample set;
calculating the alarm line alert num of the system as u + std 3; alert num is set to 100 if alert num < 100.
The alarm analysis module regularly collects logs (for example, the collection is performed once per minute, and the collection interval can be adjusted as required) in the latest t time period, and successively analyzes whether the error number of each system is abnormal, whether each service error rate of each system is abnormal, whether the instance error rate is abnormal, and the error number between any two systems, and the specific method is as follows:
1. extracting the log of the latest time t from the Logstash, namely exhausting the relation of successful number and failure number of calling b instance c service by all existing a instances, as shown in fig. 5, recording the relation as data:
2. sorting the data, merging the field consumerIp and the field consumerPort into a field clientInstance, and merging the field serviceIp and the field servicePort into a field serviceInstance;
3. sorting data, and calculating the accumulated correct number and the accumulated error number of each server instance and each serviceName;
4. traversing the system list:
a) counting the accumulated error number of each system, namely the sum of the error numbers of all services under the system;
b) if the accumulated error number of the system is larger than the alarm threshold value alert num of the system, marking the system as abnormal;
c) traversing each service (serviceName) of the system, applying the abnormity judgment method, and substituting the accumulative correct number and the accumulative error number of the serviceName to judge whether the serviceName is abnormal or not;
d) traversing each instance (server instance) of the system, applying the above-mentioned abnormal judgment method, substituting the accumulated correct number and the accumulated error number of the server instance to judge whether the system is abnormal;
5. sorting data, and calculating the error number of each group of clientInstances calling server Instances;
and reversely checking the system clients and servers corresponding to the client instance and the server instance from the system topological graph, and counting the accumulated error number of each group of client system calling the server system.
And the alarm display module is based on the system topological graph and displays the alarm data on the system topological graph after the alarm data is analyzed.
1. When the system is abnormal, adding a warning mark on a system icon, as shown in FIG. 6;
2. when the service and the instance of the system are abnormal, clicking a system icon, and displaying error information by a popup layer, as shown in FIG. 7;
3. when the number of inter-system call errors is not 0, drawing a connecting line and a directional arrow, wherein the width of the line is the logarithm of the number of errors. Other conventional formulas for substituting the error number may be used to calculate the line width, as long as the line width or color is correlated to the error number to meet the requirements of the present invention.
When the system fails, we can easily see from the figure that: which systems have problems, which systems are affected, which instances of systems and services have problems.
The invention also provides a method for monitoring the system availability based on the statistical method, which comprises the steps of service call logs among systems; analyzing an alarm threshold; alarm analysis; monitoring and alarming display; the inter-system service call log step executes contents realized by the inter-system service call log module, the alarm threshold analysis step executes contents realized by the alarm threshold analysis module, the alarm analysis step executes contents realized by the alarm analysis module, and the monitoring alarm display step executes contents realized by the monitoring alarm display module.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims (8)

1. A system for monitoring system availability based on statistical methods, comprising: the inter-system service call log module, the alarm threshold analysis module, the alarm analysis module and the monitoring alarm display module;
the inter-system service calling log module is used for collecting and recording log information of all calling between systems, calling time, IP and port number of a calling party, IP and port number of a called party, service identification of calling and success or failure;
the alarm threshold analysis module is used for regularly learning historical data, finding out the performance of each system under the general situation, obtaining n samples of the current system, wherein each sample describes the error number in unit time t, and removing abnormal points in a sample set, and the process of removing the abnormal points comprises the following steps:
i. calculating a mean of a current sample set
Figure FDA0002959669670000011
And standard deviation of
Figure FDA0002959669670000012
Finding all sample points which are more than u +3 std in the sample set, calculating the number n1 of the sample points, removing the data from the sample set, and calculating the number n2 of a new sample set after the sample points are taken;
if n1 is 0 or (n-n2) >30 or (n-n2) > n/3, completing the removal of the abnormal point and continuing to execute the following steps; otherwise, executing step i;
calculating the alarm line alert num of the system as u + std 3;
the alarm analysis module is used for regularly acquiring logs in the latest t time period, successively analyzing whether the error number of each system is abnormal, whether each service error rate of each system is abnormal, whether the instance error rate is abnormal and the error number condition between any two systems, and specifically judging the following conditions after traversing the system list:
a) if the accumulated error number of the system is larger than the alarm threshold value of the system, the system is marked to be abnormal;
b) traversing each service of the system, applying an abnormal judgment method to judge whether the error rate is abnormal or not;
c) traversing each instance of the system, and applying an abnormality judgment method to judge whether the error rate of the system is abnormal or not;
the abnormality determination method includes:
a) recording the correct times as tNum, recording the wrong times as fNum, and recording the total calling times num as tNum + fNum;
b) if fNum < the first threshold, returning to normal; otherwise, the next step;
c) if num > is the second threshold, then the next step is carried out, otherwise, when fNum is the first threshold, the operation returns to normal, otherwise, the operation returns to abnormal;
d) if fNum/num is less than the third threshold value, returning to normal; otherwise, the next step;
e) when fNum < tNum, k ═ fNum + fourth threshold, otherwise k ═ fNum-fifth threshold;
f) computing
Figure FDA0002959669670000013
If z > a sixth threshold, returning to abnormal, otherwise returning to normal;
after finishing the data, calculating the error number of each group of clientInstances calling the server Instances;
the system client and the server corresponding to the client instance and the server instance are reversely checked from the system topological graph, and the accumulated error number of each group of client system calling the server system is counted;
and the monitoring alarm display module is used for displaying the alarm data on the system topological graph after the alarm data is analyzed based on the system topological graph.
2. The system for monitoring system availability based on statistical methods of claim 1, wherein: the alarm threshold analysis module is further configured to set an alarm threshold, and after calculating the error number alarm line of the system, set alert num as the alarm threshold if alert num < alarm threshold.
3. The system for monitoring system availability based on statistical methods of claim 1, wherein: the monitoring alarm display module is also used for:
(1) when the system is abnormal, adding a warning mark on a system icon:
(2) when the service and the instance of the system are abnormal, clicking a system icon, and displaying error information by a popup layer;
(3) and when the number of the intersystem calling errors is not 0, drawing a connecting line and a directional arrow.
4. The system for monitoring system availability based on statistical methods of claim 3, wherein: the width of the connection line is related to the number of errors.
5. A method for monitoring system availability based on a statistical method is characterized by comprising the following steps:
acquiring and recording log information of all calls between systems, wherein the log information comprises call time, an IP and a port number of a calling party, an IP and a port number of a called party and a calling service identifier, and whether the log information is successful or not;
step two, regularly learning the historical data, finding out the performance of each system under the general situation, obtaining n samples of the current system, wherein each sample describes the error number in unit time t, and removing abnormal points in a sample set, and the process of removing the abnormal points comprises the following steps:
i. calculating a mean of a current sample set
Figure FDA0002959669670000021
And standard deviation of
Figure FDA0002959669670000022
Finding all sample points which are more than u +3 std in the sample set, calculating the number n1 of the sample points, removing the data from the sample set, and calculating the number n2 of a new sample set after the sample points are taken;
if n1 is 0 or (n-n2) >30 or (n-n2) > n/3, completing the removal of the abnormal point and continuing to execute the following steps; otherwise, executing step i;
calculating the alarm line alert num of the system as u + std 3;
step three, regularly collecting logs in the latest time period t, successively analyzing whether the error number of each system is abnormal, whether each service error rate of each system is abnormal, whether the instance error rate is abnormal and the error number condition between any two systems, and specifically judging the following conditions after traversing the system list:
a) if the accumulated error number of the system is larger than the alarm threshold value of the system, the system is marked to be abnormal;
b) traversing each service of the system, applying an abnormal judgment method to judge whether the error rate is abnormal or not;
c) traversing each instance of the system, and applying an abnormality judgment method to judge whether the error rate of the system is abnormal or not;
the abnormality determination method includes:
a) recording the correct times as tNum, recording the wrong times as fNum, and recording the total calling times num as tNum + fNum;
b) if fNum < the first threshold, returning to normal; otherwise, the next step;
c) if num > is the second threshold, then the next step is carried out, otherwise, when fNum is the first threshold, the operation returns to normal, otherwise, the operation returns to abnormal;
d) if fNum/num is less than the third threshold value, returning to normal; otherwise, the next step;
e) when fNum < tNum, k ═ fNum + fourth threshold, otherwise k ═ fNum-fifth threshold;
f) computing
Figure FDA0002959669670000031
If z > a sixth threshold, returning to abnormal, otherwise returning to normal;
after finishing the data, calculating the error number of each group of clientInstances calling the server Instances;
the system client and the server corresponding to the client instance and the server instance are reversely checked from the system topological graph, and the accumulated error number of each group of client system calling the server system is counted;
and step four, based on the system topological graph, displaying the alarm data on the system topological graph after the alarm data is analyzed.
6. The method of claim 5, wherein step one further comprises:
an alarm threshold is set, and after an error number alarm line of the system is calculated, alert num is set as the alarm threshold if alert num < alarm threshold.
7. The method for statistically monitoring system availability of claim 5, wherein step four further comprises the steps of:
(1) when the system is abnormal, adding a warning mark on a system icon:
(2) when the service and the instance of the system are abnormal, clicking a system icon, and displaying error information by a popup layer;
(3) and when the number of the intersystem calling errors is not 0, drawing a connecting line and a directional arrow.
8. The method of statistically monitoring system availability of claim 7, wherein: the width of the connection line is related to the number of errors.
CN201810150782.5A 2018-02-13 2018-02-13 System and method for monitoring system availability based on statistical method Active CN108599977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810150782.5A CN108599977B (en) 2018-02-13 2018-02-13 System and method for monitoring system availability based on statistical method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810150782.5A CN108599977B (en) 2018-02-13 2018-02-13 System and method for monitoring system availability based on statistical method

Publications (2)

Publication Number Publication Date
CN108599977A CN108599977A (en) 2018-09-28
CN108599977B true CN108599977B (en) 2021-09-28

Family

ID=63608860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810150782.5A Active CN108599977B (en) 2018-02-13 2018-02-13 System and method for monitoring system availability based on statistical method

Country Status (1)

Country Link
CN (1) CN108599977B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109617740A (en) * 2018-12-28 2019-04-12 广东亿迅科技有限公司 A kind of method and device that application failure quickly positions
CN110086682B (en) * 2019-05-22 2022-06-24 四川新网银行股份有限公司 Service link calling relation view and fault root cause positioning method based on TCP
CN111510351B (en) * 2020-04-10 2021-09-14 星辰天合(北京)数据科技有限公司 Anomaly detection method and device based on Promissuris monitoring system
TWI787781B (en) * 2021-04-09 2022-12-21 住華科技股份有限公司 Method and system for monitoring automatic optical inspecttion device
CN113962273B (en) * 2021-09-22 2022-03-18 北京必示科技有限公司 Multi-index-based time series anomaly detection method and system and storage medium
CN114500326B (en) * 2022-02-25 2023-08-11 北京百度网讯科技有限公司 Abnormality detection method, abnormality detection device, electronic device, and storage medium
CN115037636A (en) * 2022-06-06 2022-09-09 阿里云计算有限公司 Service quality perception method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102299897A (en) * 2010-06-23 2011-12-28 电子科技大学 Characteristic-association-based peer-to-peer networking characteristic analysis method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101953558B1 (en) * 2012-10-23 2019-03-04 한국전자통신연구원 Apparatus and Method for Fault Management of Smart Devices
CN102932466B (en) * 2012-11-07 2015-09-23 网宿科技股份有限公司 The distributed source method for supervising of content-based distributing network and system
CN103514259B (en) * 2013-08-13 2017-04-26 华北电力大学 Abnormal data detection and modification method based on numerical value relevance model
CN106407082B (en) * 2016-09-30 2019-06-14 国家电网公司 A kind of information system alarm method and device
CN107612756A (en) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 A kind of operation management system with intelligent trouble analyzing and processing function

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102299897A (en) * 2010-06-23 2011-12-28 电子科技大学 Characteristic-association-based peer-to-peer networking characteristic analysis method

Also Published As

Publication number Publication date
CN108599977A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
CN108599977B (en) System and method for monitoring system availability based on statistical method
US10346744B2 (en) System and method for visualisation of behaviour within computer infrastructure
CN108491305A (en) A kind of detection method and system of server failure
US20100082708A1 (en) System and Method for Management of Performance Fault Using Statistical Analysis
CN107302469B (en) Monitoring device and method for data update of distributed service cluster system
JPWO2011155621A1 (en) Fault detection apparatus, fault detection method, and program recording medium
CN112346393B (en) Intelligent operation and maintenance based data full link abnormity monitoring and processing method and system
CN113542017A (en) Network fault positioning method based on network topology and multiple indexes
CN115529595A (en) Method, device, equipment and medium for detecting abnormity of log data
CN106911519A (en) A kind of data acquisition monitoring method and device
CN113448805A (en) Monitoring method, device and equipment based on CPU dynamic threshold and storage medium
CN110262955B (en) Application performance monitoring tool based on pinpoint
CN111240936A (en) Data integrity checking method and equipment
CN112256548B (en) Abnormal data monitoring method and device, server and storage medium
CN113342608A (en) Method and device for monitoring streaming computing engine task
CN114429256A (en) Data monitoring method and device, electronic equipment and storage medium
CN116204386A (en) Method, system, medium and equipment for automatically identifying and monitoring application service relationship
CN114531338A (en) Monitoring alarm and tracing method and system based on call chain data
CN111626841A (en) Method, system and related equipment for monitoring online transaction
CN113037550B (en) Service fault monitoring method, system and computer readable storage medium
CN117692302B (en) Method and system for data collection, storage and intelligent monitoring and alarming
CN117743473B (en) Data management synchronization performance monitoring system
CN113032227B (en) Abnormal network element detection method and device, electronic equipment and storage medium
CN112905479B (en) Cloud platform-based method and system for determining optimal path of alarm accident root cause
CN113342623B (en) Visual early warning system and method based on dynamic threshold method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant